SIMD fácil mediante envolturas

June 25, 2015, 7:33 am

Latest and popular articles on Intel Technologies

≫ Next: Building Android* Middleware Libraries for x86 Devices Using the Android NDK

≪ Previous: Консервативное морфологическое сглаживание (CMAA) — обновление за март 2014 г.

Por Michael Kopietz, arquitecto de representación de imágenes gráficas de Crytek
Descargar PDF

1. Introducción

Este artículo busca cambiar su forma de pensar acerca de cómo aplicar la programación SIMD al código. Si piensa en los carriles SIMD como si fueran subprocesos de CPU, se le ocurrirán nuevas ideas y podrá aplicar la técnica SIMD con mayor frecuencia en el código.

Intel ha estado produciendo CPU compatibles con SIMD por el doble de tiempo que lleva fabricando CPU multinúcleo; sin embargo, el modelo de subprocesos está mucho más establecido en el desarrollo de software. Uno de los motivos es la abundancia de guías que presentan el trabajo con subprocesos de manera simple, como si se tratara solamente de ejecutar una función de entrada n veces, y dejan de lado todas las posibles complicaciones. Por su parte, las guías de SIMD tienden a concentrarse en alcanzar el 10 % de aceleración final que exige duplicar el tamaño del código. Si estas guías contienen ejemplos, resulta difícil dirigir la atención a toda la información nueva y que al mismo tiempo a uno se le ocurra cómo usarla de forma sencilla y elegante. Por eso, mostrar una manera simple y útil de usar SIMD es el objetivo principal de este artículo.

Primero vamos a explicitar el principio básico del código SIMD: la alineación. Probablemente todo el hardware SIMD exija, o al menos prefiera, cierto grado de alineación natural, y para explicar los aspectos básicos de esto último, se necesitarían unas cuantas páginas [1]. Pero en general, si uno no se está quedando sin memoria, es importante asignarla de manera que no afecte la eficiencia del caché. Para las CPU Intel, ello implica asignar memoria en un límite de 64 bytes, como se muestra en el fragmento de código 1.

inline void* operator new(size_t size)
{
	return _mm_malloc(size, 64);
}

inline void* operator new[](size_t size)
{
	return _mm_malloc(size, 64);
}

inline void operator delete(void *mem)
{
	_mm_free(mem);
}

inline void operator delete[](void *mem)
{
	_mm_free(mem);
}

Fragmento de código 1: Funciones de asignación que respetan límites de 64 bytes para que no se vea perjudicada la eficiencia del caché.

2. La idea básica

La manera de comenzar es sencilla: suponer que cada carril de un registro SIMD se ejecuta como un subproceso. En el caso de Intel® Streaming SIMD Extensions (Intel® SSE), se tienen 4 subprocesos/carriles, mientras que son 8 en Intel® Advanced Ventor Extensions (Intel® AVX) y 16 en los coprocesadores Intel® Xeon-p Phi.

Para contar con una solución inmediata, el primer paso es implementar clases que se comporten en su mayor parte como tipos de datos primitivos. Hay que envolver “int”, “float”, etc. y usar esas envolturas como punto de partida para cada implementación SIMD. Para la versión de Intel SSE, se debe reemplazar el componente flotante __m128, int e int sin signo con __m128i e implementar operadores por medio de funciones intrínsecas de Intel SSE o de Intel AVX, como en el fragmento de código 2.

// VER 128-bit
inline	DRealF	operator+(DRealF R)const{return DRealF(_mm_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(DRealF R)const{return DRealF(_mm_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(DRealF R)const{return DRealF(_mm_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(DRealF R)const{return DRealF(_mm_div_ps(m_V, R.m_V));}

// AVX 256-bit
inline	DRealF	operator+(const DRealF& R)const{return DRealF(_mm256_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(const DRealF& R)const{return DRealF(_mm256_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(const DRealF& R)const{return DRealF(_mm256_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(const DRealF& R)const{return DRealF(_mm256_div_ps(m_V, R.m_V));}

Fragmento de código 2: Operadores aritméticos sobrecargados para envolturas SIMD

3. Ejemplo de uso

Ahora supongamos que estamos trabajando en dos imágenes HDR en las cuales cada píxel es flotante, y se hace una fusión entre ambas imágenes.

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)
{
	const DRealF BlendA(1.f - Factor);
	const DRealF BlendB(Factor);
	for(size_t i = 0; i < PixelCount; i += THREAD_COUNT)
		*(DRealF*)(pOut + i) = *(DRealF*)(pInA + i) * BlendA + *(DRealF*)(pInB + i) + BlendB;
}

Fragmento de código 3: Función de fusión que puede trabajar tanto con tipos de datos primitivos como con datos SIMD.

El ejecutable generado a partir del fragmento de código 3 se ejecuta nativamente en registros normales y tanto en Intel SSE como Intel AVX. No es realmente el modo convencional en que uno lo escribiría, pero todos los programadores en C++ deberían ser capaces de leerlo y entenderlo. Veamos si es lo que parece. La primera y segunda líneas de la implementación inicializan los factores de fusión de nuestra interpolación lineal; para ello, reproducen el parámetro al ancho que tenga el registro SIMD.

La tercera línea es casi un bucle normal. Lo único fuera de lo común es “THREAD_COUNT”. Vale 1 en el caso de los registros normales, 4 para Intel SSE y 8 para Intel AVX; es la cantidad de carriles contados del registro, que en nuestro caso se parece a la de subprocesos.

La cuarta línea indexa en los arreglos, y ambos píxeles de entrada se cambian de escala en función de los factores de fusión y se los suma. Según la preferencia de escritura, se pueden usar temporales, pero no hay intrínsecas que sea necesario buscar, no hay implementación por plataforma.

4. La hora de la verdad

Ahora llegó el momento de demostrar que funciona. Tomemos una implementación de hash MD5 convencional y usemos todo el poder de cálculo de la CPU para buscar la preimagen. Para ello, reemplazaremos los tipos primitivos con nuestros tipos SIMD. MD5 ejecuta varias “rondas” que aplican diversas operaciones de bit simples en enteros sin signo, como se demostró en el fragmento de código 4.

#define LEFTROTATE(x, c) (((x) << (c)) | ((x) >> (32 - (c))))
#define BLEND(a, b, x) SelectBit(a, b, x)

template<int r>
inline DRealU Step1(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(d, c, b);
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step2(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(c, b, d);
	return b + LEFTROTATE((a + f + k + w),r);
}

template<int r>
inline DRealU Step3(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = b ^ c ^ d;
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step4(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = c ^ (b | (~d));
	return b + LEFTROTATE((a + f + k + w), r);
}

Fragmento de código 4: Funciones escalón MD5 para envolturas SIMD

Además del nombre de los tipos, hay solo un cambio que podría verse un poco como magia: el “SelectBit”. Si se establece un bit de x, se devuelve el respectivo bit de b; si no, el bit respectivo de a. En otras palabras, una fusión. En el fragmento de código 5 se muestra la función hash MD5 principal.

inline void MD5(const uint8_t* pMSG,DRealU& h0,DRealU& h1,DRealU& h2,DRealU& h3,uint32_t Offset)
{
	const DRealU w0  =	Offset(DRealU(*reinterpret_cast<const uint32_t*>(pMSG + 0 * 4) + Offset));
	const DRealU w1  =	*reinterpret_cast<const uint32_t*>(pMSG + 1 * 4);
	const DRealU w2  =	*reinterpret_cast<const uint32_t*>(pMSG + 2 * 4);
	const DRealU w3  =	*reinterpret_cast<const uint32_t*>(pMSG + 3 * 4);
	const DRealU w4  =	*reinterpret_cast<const uint32_t*>(pMSG + 4 * 4);
	const DRealU w5  =	*reinterpret_cast<const uint32_t*>(pMSG + 5 * 4);
	const DRealU w6  =	*reinterpret_cast<const uint32_t*>(pMSG + 6 * 4);
	const DRealU w7  =	*reinterpret_cast<const uint32_t*>(pMSG + 7 * 4);
	const DRealU w8  =	*reinterpret_cast<const uint32_t*>(pMSG + 8 * 4);
	const DRealU w9  =	*reinterpret_cast<const uint32_t*>(pMSG + 9 * 4);
	const DRealU w10 =	*reinterpret_cast<const uint32_t*>(pMSG + 10 * 4);
	const DRealU w11 =	*reinterpret_cast<const uint32_t*>(pMSG + 11 * 4);
	const DRealU w12 =	*reinterpret_cast<const uint32_t*>(pMSG + 12 * 4);
	const DRealU w13 =	*reinterpret_cast<const uint32_t*>(pMSG + 13 * 4);
	const DRealU w14 =	*reinterpret_cast<const uint32_t*>(pMSG + 14 * 4);
	const DRealU w15 =	*reinterpret_cast<const uint32_t*>(pMSG + 15 * 4);

	DRealU a = h0;
	DRealU b = h1;
	DRealU c = h2;
	DRealU d = h3;

	a = Step1< 7>(a, b, c, d, k0, w0);
	d = Step1<12>(d, a, b, c, k1, w1);
	.
	.
	.
	d = Step4<10>(d, a, b, c, k61, w11);
	c = Step4<15>(c, d, a, b, k62, w2);
	b = Step4<21>(b, c, d, a, k63, w9);

	h0 += a;
	h1 += b;
	h2 += c;
	h3 += d;
}

Fragmento de código 5: La función MD5 principal

La mayoría del código es otra vez como en una función normal de C, excepto que las primeras líneas reproducen nuestros registros SIMD con el parámetro pasado, con el fin de preparar los datos. En este caso, cargamos los registros de SIMD con los datos que queremos “hashear”. Una especialidad es la llamada “Offset”, porque no conviene que todos los carriles SIMD hagan exactamente lo mismo. Esta llamada desplaza el registro en función del índice de carril. Es como agregar un identificador de subproceso. Recomendamos consultar el fragmento de código 6.

Offset(Register)
{
	for(i = 0; i < THREAD_COUNT; i++)
		Register[i] += i;
}

Fragmento de código 6: Offset es una función para trabajar con diferentes anchos de registro.

Eso significa que el primer elemento que debemos llevar a la imagen de la función hash no es [0, 0, 0, 0] para Intel SSE ni [0, 0, 0, 0, 0, 0, 0, 0] para Intel AVX. Son [0, 1, 2, 3] y [0, 1, 2, 3, 4, 5, 6, 7], respectivamente. Esto imita el efecto de ejecutar la función en paralelo por medio de 4 u 8 subprocesos/núcleos, pero en el caso de SIMD, en paralelo a las instrucciones.

En la Tabla 1 podemos ver los resultados de nuestros 10 minutos de exigente trabajo para pasar esta función a SIMD.

Tabla 1: Rendimiento de MD5 con tipos primitivos y SIMD

Tipo	Tiempo	Aceleración
Entero x86	379.389s	1.0 vez
SSE4	108.108s	3.5 veces
AVX2	51.490s	7.4 veces

5. Más allá de los subprocesos SIMD simples

Los resultados son satisfactorios, sin cambios de escala lineales, ya que hay siempre una parte que no corresponde a subprocesos (es fácil identificarla en el código fuente proporcionado). Pero no apuntamos al último 10 % con el doble de trabajo. Como programadores, preferimos otras soluciones rápidas que maximicen la ganancia. Siempre surgen algunas cuestiones para considerar, como si valdría la pena desenrollar el bucle.

El hashing del MD5 parece depender con frecuencia del resultado de operaciones anteriores, lo cual no se lleva muy bien con los pipelines de CPU, pero podríamos quedar enlazados al registro si desenrollamos. Nuestras envolturas nos pueden ayudar a evaluar esto último con facilidad. Desenrollar es la versión en software del hyper-threading. Emulamos el doble de los subprocesos en ejecución, y para hacer esto repetimos la ejecución de operaciones en el doble de datos que los carriles SIMD disponibles. Por lo tanto, creamos un tipo duplicado similar y desenrollamos en el interior mediante la duplicación de todas las operaciones para nuestros operadores básicos, como en el fragmento de código 7.

struct __m1282
{
	__m128		m_V0;
	__m128		m_V1;
	inline		__m1282(){}
	inline		__m1282(__m128 C0, __m128 C1):m_V0(C0), m_V1(C1){}
};

inline	DRealF	operator+(DRealF R)const
	{return __m1282(_mm_add_ps(m_V.m_V0, R.m_V.m_V0),_mm_add_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator-(DRealF R)const
	{return __m1282(_mm_sub_ps(m_V.m_V0, R.m_V.m_V0),_mm_sub_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator*(DRealF R)const
	{return __m1282(_mm_mul_ps(m_V.m_V0, R.m_V.m_V0),_mm_mul_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator/(DRealF R)const
	{return __m1282(_mm_div_ps(m_V.m_V0, R.m_V.m_V0),_mm_div_ps(m_V.m_V1, R.m_V.m_V1));}

Fragmento de código 7: Estos operadores se reimplementan para trabajar con dos registros SSE al mismo tiempo

Y ya está. Ahora podemos volver a ejecutar los tiempos de la función hash MD5.

Tabla 2: Rendimiento del MD5 con tipos SIMD y desenrollado de bucle

Tipo	Tiempo	Aceleración
Entero x86	379.389s	1.0 vez
SSE4	108.108s	3.5 veces
SSE4 x2	75.659s	4.8 veces
AVX2	51.490s	7.4 veces
AVX2 x2	36.014s	10.5 veces

Los datos de la Tabla 2 muestran que sin dudas vale la pena desenrollar. Logramos mayor velocidad más allá del cambio de escala de conteo de carriles SIMD, probablemente porque la versión entero x86 ya estaba frenando el pipeline con dependencias de operaciones.

6. Subprocesos SIMD más complejos

Hasta ahora nuestros ejemplos fueron simples en el sentido de que el código era el candidato natural para vectorizar a mano. No tenían nada de complejo más allá de un montón de operaciones que exigían muchos cálculos. ¿Pero qué haríamos ante situaciones más complejas, como las bifurcaciones?

La solución es otra vez bastante simple y de uso muy difundido: cálculo especulativo y enmascaramiento. Todo aquel que haya trabajado con sombreadores o lenguajes informáticos ya se habrá encontrado con esto antes. Echemos un vistazo a la rama básica del fragmento de código 8 y reescribámosla a un operador ?:, como en el fragmento de código 9.

int a = 0;
if(i % 2 == 1)
	a = 1;
else
	a = 3;

Fragmento de código 8: Usa if-else para calcular la máscara

int a = (i % 2) ? 1 : 3;

Fragmento de código 9: Usa el operador ternario ?: para calcular la máscara.

También podemos usar el operador selector de bits del fragmento de código 4 y lograr lo mismo solo con operaciones de bits en el fragmento de código 10.

int Mask = (i % 2) ? ~0 : 0;
int a = SelectBit(3, 1, Mask);

Fragmento de código 10: El uso de SelectBit prepara para los registros SIMD como datos

Eso parecería ser inútil si todavía tenemos un operador ?: para crear la máscara, y la comparación no da un resultado de verdadero o falso, sino bits establecidos o eliminados. Pero no hay ningún problema, porque la cantidad total de bits establecidos o eliminados es lo que realmente devuelve la instrucción de comparación de Intel SSE y Intel AVX.

Por supuesto que en lugar de asignar solo 3 o 1, se puede llamar a funciones y seleccionar la devolución de resultado que uno desee. De esa manera podría mejorarse el rendimiento incluso en código no vectorizado, porque se evitan las bifurcaciones y la CPU nunca sufre por predicción errónea de bifurcación, aunque cuanto más complejas sean las funciones que uno llame, mayor posibilidad habrá de predicciones erróneas. Incluso en el código vectorizado, evitaremos ejecutar bifurcaciones largas innecesarias. La manera de hacerlo es revisar los casos especiales en los cuales todos los elementos de nuestro registro SIMD tienen el mismo resultado de comparación, como se muestra en el fragmento de código 11.

int Mask = (i % 2) ? ~0 : 0;
int a = 0;
if(All(Mask))
	a = Function1();
else
if(None(Mask))
	a = Function3();
else
	a = BitSelect(Function3(), Function1(), Mask);

Fragmento de código 11: Muestra una selección sin bifurcaciones y optimizada, entre dos funciones

Así se detectan los casos especiales en los cuales todos los elementos son “verdadero” o todos son “falso”. Esos casos se ejecutan en SIMD de la misma manera que en x86. El flujo de ejecución divergiría nada más que en el último “else”. Por lo tanto, tenemos que usar selección de bits.

Si Function1 o Function3 modifican algún dato, habrá que pasar la máscara por la llamada y seleccionar las modificaciones por bits de manera explícita, tal como lo hemos hecho en este apartado. Para ser una solución inmediata, lleva bastante trabajo, pero el código que se obtiene pueden leerlo la mayoría de los programadores.

7. Ejemplo de código

Volvamos a tomar código fuente y echar en él nuestros tipos SIMD. Un caso muy interesante es el uso de trazado de rayos para campos de distancia. Usaremos la escena de la demo de Iñigo Quilez [2], que ha tenido la gentileza de darnos su permiso. La imagen se muestra en la Figura 1.

Figura 1: Escena de prueba de la demo de raycasting de Iñigo Quilez.

El “subprocesamiento SIMD” se coloca donde uno agregaría el subprocesamiento. Cada subproceso se encarga de un píxel, y atraviesa el escenario hasta chocar contra algo. Después, se aplica un poco de sombreado, se convierte el píxel a RGBA y se lo escribe al búfer de tramas.

El acto de atravesar la escena se hace de manera iterativa. Cada rayo tiene una cantidad impredecible de pasos hasta que se reconoce un choque. Por ejemplo, si hubiera una pared en primer plano, se alcanzaría después de pocos pasos, mientras que algunos rayos se desplazan la distancia máxima de trazado sin chocar contra nada. El bucle principal del fragmento de código 12 se encarga de ambos casos. Usa el método de selección de bits que tratamos en la sección anterior.

DRealU LoopMask(RTrue);
for(; a < 128; a++)

{
      DRealF Dist             =     SceneDist(O.x, O.y, O.z, C);
      DRealU DistU            =     *reinterpret_cast<DRealU*>(&Dist) & DMask(LoopMask);
      Dist                    =     *reinterpret_cast<DRealF*>(&DistU);
      TotalDist               =     TotalDist + Dist;
      O                       +=    D * Dist;
      LoopMask                =     LoopMask && Dist > MinDist && TotalDist < MaxDist;
      if(DNone(LoopMask))
            break;
}

Fragmento de código 12: Raycasting con tipos SIMD

La variable LoopMask identifica con ~0 o 0 que un rayo está activo, en cuyo caso ya terminamos con ese rayo. Al final del bucle, nos fijamos si ya no hay rayos activos, y si no los hay, salimos del bucle.

En la línea de arriba, evaluamos nuestras condiciones para los rayos y determinamos si estamos lo suficientemente cerca de un objeto para considerarlo un choque o si el rayo ya ha sobrepasado la distancia máxima que queremos trazar. Lo unimos lógicamente al resultado anterior con AND, dado que el rayo podría haber sido ya dejado de lado en una de las iteraciones anteriores.

“SceneDist” es la función de evaluación para el trazado: se ejecuta para todos los carriles SIMD y se trata de una función muy ponderada que devuelve la distancia actual al objeto más cercano. La línea siguiente establece en 0 la distancia a los elementos en el caso de los rayos que ya no están activos y traslada esta cantidad para la iteración siguiente.

La “SceneDist” original tenía algunas optimizaciones para ensamblador y un manejo de materiales que no necesitamos en nuestra prueba. Esta función está reducida al mínimo que necesitamos para tener un ejemplo complejo. Todavía contiene algunos “if” que se manejan de la misma manera que antes. En general, “SceneDist” es bastante grande y compleja. Llevaría mucho tiempo reescribirla a mano para cada plataforma SIMD una y otra vez. Habría que convertirla toda de un plumazo, y algunos errores al escribirla podrían hacer que los resultados fueran incorrectos. Además, aunque funcionara, tendríamos solo unas pocas funciones que realmente entenderíamos, además de que exige mucha mayor intervención. Hacerlo a mano sería el último recurso. Comparado con eso, nuestros cambios son relativamente pequeños. Es fácil de modificar y es posible ampliar el aspecto visual sin necesidad de preocuparse por volver a optimizarla y de ser el único que entiende el código; es igual que si agregáramos subprocesos reales.

El trabajo que hicimos fue para ver resultados, así que analicemos los tiempos de la Tabla 3.

Tabla 3: Rendimiento de trazado de rayos con tipo primitivos y SIMD, incluidos los de desenrollado de bucles.

Tipo	FPS	Aceleración
x86	0.992FPS	1.0 vez
SSE4	3.744FPS	3.8 veces
SSE4 x2	3.282FPS	3.3 veces
AVX2	6.960FPS	7.0 veces
AVX2 x2	5.947FPS	6.0 veces

Se puede ver con claridad que la aceleración no se modifica linealmente con la cantidad de elementos, lo cual se debe más que nada a la divergencia. Algunos rayos podrían necesitar 10 veces más iteraciones que otros.

8. ¿Por qué no dejamos que lo haga el compilador?

Los compiladores actuales son capaces de vectorizar hasta cierto grado, pero la mayor prioridad para el código generado es que los resultados sean correctos, ya que nadie usaría binarios 100 veces más rápidos si los resultados que dieran fueran erróneos, por más que solo fuera el 1 % de las veces. Algunas de nuestras suposiciones, como que los datos están alineados para SIMD y asignamos suficiente relleno como para no sobrescribir asignaciones consecutivas, escapan a las posibilidades del compilador. Uno puede recibir anotaciones del compilador Intel acerca de todas las oportunidades que tuvo de hacer omisiones por suposiciones que no podía garantizar, y a partir de ello intentar reorganizar el código y hacer promesas al compilador para que genere la versión vectorizada. Pero habría que hacer este trabajo cada vez que se modifique el código, y en casos más complejos, como cuando hay bifurcación, uno no puede más que adivinar si el resultado va a ser código serializado o selección de bits sin bifurcación.

Además, el compilador no tiene idea de lo que uno quiere crear. Uno sabe si los subprocesos van a divergir o ser coherentes, e implementa una solución bifurcada o que seleccione bits. También ve el punto de ataque, el bucle que más sentido tendría cambiar a SIMD, mientras que al compilador no le queda sino adivinar si va a iterar diez veces o un millón.

Al confiar la vectorización al compilador, se gana por una parte y se pierde por otra. Es bueno contar con esta opción, tal como la de colocar subprocesos a mano.

9. ¿Subprocesamiento real?

Sí, el subprocesamiento real es útil y los subprocesos SIMD no son un reemplazo; ambos son ortogonales. Los subprocesos SIMD todavía no son tan simples de ejecutar como los reales, pero causan menos problemas de sincronización y pocas veces producen errores. La gran ventaja es que todos los núcleos que vende Intel pueden ejecutar las versiones de subprocesos SIMD con todos los “subprocesos”. Una CPU de dos núcleos funcionará 4 u 8 veces más rápido, igual que el Haswell-EP de 15 núcleos y cuatro zócalos. En las tablas 4 a 7 se resumen algunos resultados de nuestros bancos de pruebas en combinación con subprocesamiento.

Tabla 4: Rendimiento de MD5 en Intel® Core™ i7 4770K con SIMD y con subprocesamiento

Subprocesos	Tipo	Tiempo	Aceleración
1T	Entero x86	311.704s	1.00 vez
8T	Entero x86	47.032s	6.63 veces
1T	SSE4	90.601s	3.44 veces
8T	SSE4	14.965s	20.83 veces
1T	SSE4 x2	62.225s	5.01 veces
8T	SSE4 x2	12.203s	25.54 veces
1T	AVX2	42.071s	7.41 veces
8T	AVX2	6.474s	48.15 veces
1T	AVX2 x2	29.612s	10.53 veces
8T	AVX2 x2	5.616s	55.50 veces

Tabla 5: Rendimiento de trazado de rayos en Intel® Core™ i7 4770K con SIMD y con subprocesamiento

Subprocesos	Tipo	FPS	Aceleración
1T	Entero x86	1.202FPS	1.00 vez
8T	Entero x86	6.019FPS	5.01 veces
1T	SSE4	4.674FPS	3.89 veces
8T	SSE4	23.298FPS	19.38 veces
1T	SSE4 x2	4.053FPS	3.37 veces
8T	SSE4 x2	20.537FPS	17.09 veces
1T	AVX2	8.646FPS	4.70 veces
8T	AVX2	42.444FPS	35.31 veces
1T	AVX2 x2	7.291FPS	6.07 veces
8T	AVX2 x2	36.776FPS	30.60 veces

Tabla 6: Rendimiento de MD5 en Intel® Core™ i7 5960X con SIMD y con subprocesamiento

Subprocesos	Tipo	Tiempo	Aceleración
1T	Entero x86	379.389s	1.00 vez
16T	Entero x86	28.499s	13.34 veces
1T	SSE4	108.108s	3.51 veces
16T	SSE4	9.194s	41.26 veces
1T	SSE4 x2	75.694s	5.01 veces
16T	SSE4 x2	7.381s	51.40 veces
1T	AVX2	51.490s	3.37 veces
16T	AVX2	3.965s	95.68 veces
1T	AVX2 x2	36.015s	10.53 veces
16T	AVX2 x2	3.387s	112.01 veces

Tabla 7: Rendimiento de trazado de rayos en Intel® Core™ i7 5960X con SIMD y con subprocesamiento

Subprocesos	Tipo	FPS	Aceleración
1T	Entero x86	0.992FPS	1.00 vez
16T	Entero x86	6.813FPS	6.87 veces
1T	SSE4	3.744FPS	3.774 veces
16T	SSE4	37.927FPS	38.23 veces
1T	SSE4 x2	3.282FPS	3.31 veces
16T	SSE4 x2	33.770FPS	34.04 veces
1T	AVX2	6.960FPS	7.02 veces
16T	AVX2	70.545FPS	71.11 veces
1T	AVX2 x2	5.947FPS	6.00 veces
16T	AVX2 x2	59.252FPS	59.76 veces

¹ El software y las cargas de trabajo usados en la pruebas de rendimiento puede que hayan sido optimizados para rendimiento en microprocesadores Intel solamente. Las pruebas de rendimiento, tales como SYSmark* y MobileMark*, se miden con sistemas informáticos, componentes, software, operaciones y funciones específicos. Todo cambio en cualquiera de esos factores puede hacer que varíen los resultados. Debe consultar más información y otras pruebas de rendimiento que lo ayuden a evaluar íntegramente las compras que contemple hacer, incluido el rendimiento del producto al combinarlo con otros. Encontrará más información en http://www.intel.com/performance.

Como puede verse, los resultados varían en función de la CPU; los resultados de subprocesos SIMD cambian de manera similar. Llama la atención que se logran factores de aceleración de más de 30 cuando se combinan ambas ideas. Tiene sentido optar por la aceleración por ocho en CPU de dos núcleos, pero también lo tiene ir por ocho veces más en hardware más sofisticado.

¡Vamos! ¡Hay que animarse y sumar SIMD al código!

Acerca del autor

Michael Kopietz es arquitecto de representación gráfica del departamento de investigación de Crytek. Lidera un equipo de ingenieros que se encargan de la representación gráfica de CryEngine(R) y también orienta a estudiantes que están preparando sus tesis. Trabajó, entre otras cosas, en arquitectura de representación gráfica multiplataforma, software de representación gráfica y servidores de alta sensibilidad, siempre con la idea de lograr alto rendimiento y trabajar con código reutilizable. Antes, participó en el desarrollo de juegos de batallas navales y simulación de fútbol. Como sus inicios fueron en la programación en ensamblador de las primeras consolas hogareñas, para él cada ciclo cuenta.

Licencia del código

Enlaces de consulta

[1] Manejo de memoria para optimizar el rendimiento en el coprocesador Intel® Xeon Phi™: alineación y precarga https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and

[2] Representación de escenarios con dos triángulos, por Iñigo Quilez http://www.iquilezles.org/www/material/nvscene2008/nvscene2008.htm

Microsoft Windows* 8.x

Intel® Streaming SIMD Extensions

Desarrollo de juegos

Procesadores Intel® Atom™

Procesadores Intel® Core™

Computación en paralelo

URL:

Memory Management for Optimal Performance on Intel® Xeon Phi™ Coprocessor: Alignment and Prefetching

Desarrollo de juegos

Edición más reciente por:

Debra G. (Intel)

↧

Building Android* Middleware Libraries for x86 Devices Using the Android NDK

June 26, 2015, 5:08 pm

Latest and popular articles on Intel Technologies

≫ Next: Diagnostic 13379: loop was not vectorized with "simd"

≪ Previous: SIMD fácil mediante envolturas

There are many middleware libraries out there that developers are using to build great Android apps. The app may have been released some time ago in the Google* Play store and the library may have only supported arm devices at that time. Did you know you could reach a larger audience for your app by adding native x86 support? Building for x86 yields the best performance and experience for Android x86 based devices while not breaking compatibility with arm devices. Many of these libraries have been updated and build for x86 by default. There is a great article that details how to do this further at the link below.

https://software.intel.com/en-us/android/articles/using-the-android-x86-ndk-with-eclipse-and-porting-an-ndk-sample-app

Here is a list of some Android middleware libraries that support x86:

Cocos2d-x
OpenAL
GL2-android
MuPDF
Freetype
Vitamio
Marmalade
Ffmpeg
OpenSSL
CSipSimple
Opensl-soundpool
SDL
Unreal Engine
Xamarin
Unity
LibGDX
AndEngine
FMOD
GameMaker: Studio
OpenCV
Aviary
ZBar
Appcelerator Titanium
RenderScript
SQLCipher
aac-decoder
ZLib
GPUImage
SQLite3
MP3 LAME
MobileAppTracking Unity Plugin
libxmp
Immersion Haptic SDK
PDFViewer SDK
Android GifDrawable
Kamcord
Tesseract
Leptonica
libaal
BASS
Speex
NexPlayer SDK (NexStreaming)
Bangcle
Chipmunk
cURL
Gideros
JavaCV
Android Image Filter
Dropbox Sync
Intel TBB
OpenVPN
Metaio
RedLaser
Conceal
SyncNow

Are you using other libraries that have x86 support? Please comment and we’ll add it to this growing list!

Imagen del icono:

Incluir en RSS:

Avanzado

Intermedio

↧

Diagnostic 13379: loop was not vectorized with "simd"

June 28, 2015, 9:23 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel Keynote and Intel technical presentations at Spark Summit West 2015

≪ Previous: Building Android* Middleware Libraries for x86 Devices Using the Android NDK

Product Version: Intel® Fortran Compiler 15.0 and above

Cause:

When a loop contains a conditional statement which controls the assignment of a scalar value AND the scalar value is referenced AFTER the loop exits. The vectorization report generated using Intel® Fortran Compiler's optimization and vectorization report options includes non-vectorized loop instance:

Windows* OS: /O2 /Qopt-report:2 /Qopt-report-phase:vec

Linux OS or OS X: -O2 -qopt-report2 -qopt-report-phase=vec

Example:

An example below will generate the following remark in optimization report:

subroutine f13379( a, b, n )
implicit none
integer :: a(n), b(n), n

integer :: i, x=10

!dir$ simd
do i=1,n
  if( a(i) > 0 ) then
     x = i  !...here is the conditional assignment
  end if
  b(i) = x
end do
!... reference the scalar outside of the loop
write(*,*) "last value of x: ", x
end subroutine f13379

ifort -c /O2 /Qopt-report:2 /Qopt-report-phase:vec /Qopt-report-file:stdout f13379.f90

Begin optimization report for: F13379

Report from: Vector optimizations [vec]

LOOP BEGIN at f13379.f90(8,1)
....
remark #13379: loop was not vectorized with "simd"
LOOP END

Resolution:

The reference of the scalar after the loop requires that the value coming out of the loop is "correct", meaning that the loop iterations were executed strictly in-order and sequentially. IF the scalar is NOT referenced outside of the loop, the compiler can can vectorize this loop since the order of that the iterations are evaluated does not matter - without reference outside the loop the final value of the scalar does not matter since it is no longer referenced.

Example

subroutine f13379( a, b, n )
implicit none
integer :: a(n), b(n), n

integer :: i, x=10

!dir$ simd
do i=1,n
  if( a(i) > 0 ) then
     x = i  !...here is the conditional assignment
  end if
  b(i) = x
end do
!... no reference to scalar X outside of the loop
!... removed the WRITE statment for X
end subroutine f13379

Begin optimization report for: F13379
Report from: Vector optimizations [vec]

LOOP BEGIN at f13379.f90(8,1)
f13379.f90(8,1):remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END

Back to the list of vectorization diagnostics for Intel® Fortran

Intel Compilers Vectorization Reports

Microsoft Windows* 10

Microsoft Windows* 8.x

Intel® Parallel Studio XE

Intel® Parallel Studio XE Composer Edition

Intel® Parallel Studio XE Professional Edition

Herramientas de desarrollo

Optimización

Computación en paralelo

Mejora del rendimiento

↧

Intel Keynote and Intel technical presentations at Spark Summit West 2015

June 30, 2015, 12:43 pm

Latest and popular articles on Intel Technologies

≫ Next: Building Cross-OS Mobile Applications with Intel® INDE

≪ Previous: Diagnostic 13379: loop was not vectorized with "simd"

Accelerating Apache Spark-based Analytics on Intel Architecture

Michael A. Greene (Intel Software and Services Group)
To find new trends and strong patterns from large complex data sets, a strong analytics foundation is needed. Intel is working closely with Databricks, AMPLab, Spark community and its ecosystem to advance these analytics capabilities for Spark on Intel® architecture platforms and to accelerate the development of the Spark-based applications. Intel Architecture offers advanced silicon acceleration & built-in security technologies. By building on this trusted foundation and extending & optimizing the rich capabilities of Spark, we are accelerating the speed by which our customers derive real-time analytics insights and deliver meaningful solutions.
View Michael Greene’s keynote video.
Slides PDF

How to Boost 100x Performance for Real World Application w/ Apache Spark

With the bloom of Apache spark, various big data applications shift to Spark pool to pursue better user experience. However the initial performance doesn’t always meet expectation. In this talk, we will share our experience on working with several top China internet companies to build their next generation big data engine on Spark – including graph analysis, interactive, batch OLAP/BI and real-time analytics. With careful tuning, Spark brought x5-100 speedup versus their original Map Reduce implements. We even accumulated certain experience to further improve the user experience from building real-world Spark application in production environment. We expect this talk will be very useful for people who want to deploy their own spark application and also spark developers who are interested to learn some real case challenges.
Slides PDF Video

Towards Benchmarking Modern Distributed Streaming Systems

In general, we presented a common benchmark for modern distributed stream computing system. It helps to characterize the stream system like Spark-streaming and Storm, from performance, reliability and availability perspectives. For example, Spark-streaming is good at high throughput and better fault tolerant. Meanwhile Storm can response quickly, but has some defects in complex computation case. In addition, it can also be the integration test suite to evaluate different release candidates.

Slides PDF Video

Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing

Spark nodes are shifting from commodity hardware to more powerful systems with higher memory environments (200GB+). As an in-memory computing framework, popular wisdom has it that large Java heaps result in long garbage collection pauses slowing down Spark’s overall throughput. Through several case studies using large Java heaps, we will show it is possible to maintain low GC pauses for better application throughput. In this presentation, we introduce the Hotspot G1 collector as the best GC for Spark solutions running in large memory environments. We first discuss Hotspot G1 internal operations and several tuning flags. Those flags can be used to set desired GC pause target, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several case studies from Spark graph computing application running 80GB+ heap to show how we can tune those flags to remove unpredicted and protracted GC pauses for better application throughput.
Slides PDF Video

SparkR: The Past, the Present and the Future

The SparkR project provides language bindings and runtime support to enable users to run scalable computation from R using Apache Spark. SparkR has an active set of contributors from many companies and a number of recent developments have improved performance and usability. Some of the improvements include:
a new R to JVM bridge that enables easy deployment to YARN clusters,
serialization-deserialization routines that enable integration with other Spark components like ML Pipelines,
complete RDD API with support coming for DataFrames and
performance improvements for various operations including shuffles.
This talk will present an overview of the project, outline some of the technical contributions and discuss new features we will build over the next year. We will also present a demo showcasing how SparkR can be used to seamlessly process large datasets on a cluster directly from the R console.
Slides PDF Video

Big data (datos a gran escala)

Dirección URL

↧

Building Cross-OS Mobile Applications with Intel® INDE

July 1, 2015, 10:00 am

Latest and popular articles on Intel Technologies

≫ Next: Removing CPU-GPU sync stalls in Galactic Civilizations* 3

≪ Previous: Intel Keynote and Intel technical presentations at Spark Summit West 2015

If you want to get started and build cross-OS mobile applications with video processing, transcode, special effects etc features into your applications, the best place to start is with Media for Mobile which is part of INDE suite. Why? because it available for download with INDE starter (free) edition here: https://software.intel.com/en-us/intel-inde and includes many samples and you can build them into full featured applicationds and can be easily incorporated and deployed into one's own applications with minimum coding effort. It also includes a set of easy to use components and APIs for a wide range of media scenarios. It contains several complete pipelines for most popular use cases and provides a possibility to add user-developed components to those pipelines.Media for Mobile provides different set of APIs to run on all three operating systems (Android*, iOS* and Windows*) and provides application support to run on both Intel® and ARM* devices.

Building an Android* Application

Building an iOS* application

Building a Windows* application

First, let's start with building and running Android mobile applications. You can start by downloading and installing Intel® INDE Media for Mobile and samples here. Now, let's run Media for Mobile sample application for Android* on your device right from Eclipse* IDE. But, make sure you have ADB driver for your device installed and USB debugging turned on your device. In Eclipse IDE add DDMS perspective to manage the device and add Console and Logcat views in your Java perspective, as these will help you monitor status of application while running on your device.

Now select SamplesMainActivity project, go to Run menu and select Run. Following is snapshot of what you will view when you run SampleMainActivity on you device.

All the features are essentially included in a single Android application with several screens and its activities mapped in source code. When you run the application, you can see a menu of features and most of the feature have a very intuitive interface and is very easy to run.

Video transcoding with Media for Mobile

For example, let's run Transcode Video in the application. Click on "folder" button and select a video file to transcode. Click on trascode and select required frame size and bitrate and click on Start. Now, you can either wait for the whole file to complete transcoding or click on stop. Click on OK under the message "Transcoding finished". Now click on Play to playback the trascoded video.

Streaming with Media for Mobile

Now, similarly for streaming features (Camera streaming, Game Streaming or Media File streaming) it requires setting up your own streaming server (bit complicated, but who doesn't like a challenge?) using Wowza* Streaming Engine software. You can get started with configuri ng your own server and once complete you can run any of the above streaming features. In the application configure following parameters Host, Port, Application Name and Stream Name. Stream name must be in the following format: “mp4:yourStreamName”.

Second, let's start with building and running iOS* mobile applications. You can start by downloading and installing Intel® INDE Media for Mobile and samples here. Now on Xcode run Product->Build and application should be successfully built. You can see resulting package in project subfolder Products.

Video effects with Media for Mobile

Now run Product->Run and you can see application running on selected iOS* device or simulator. Select file, then select any video file and apply filters to change effects to your video file.

Third, let's start with building and running Windows* mobile applications. You can start by downloading and installing Intel® INDE Media for Mobile and samples here. When you run the application on your local machine you will see Video processing and capturing features.

Video Transcode and playback with Media for Mobile

For example, let's run Transcode Video in the application. Click on pick video and select a video file to transcode, then click on Options and select appropriate output video resolution, frame rate and bitrate and click on trascode button to start transcoding. After trasncoding is finished you can watch the output video. The output streams will be saved in C:\Users\<username>\Videos folder.

Now, to run Video Recording feature allow Media for Mobile application to use your webcam and microphone and start playing with this feature.

Now not only I have Mobile application running on Android*, iOS* and Windows*, I can also start developing and enabling other features in my application. For more info please see the links below.

More Intel® INDE Media for Mobile links you would be interested:

Ask questions, connect with Intel® INDE experts and fellow INDE developers at INDE Forum, or on StackOverflow using keywords “Intel INDE".

#INDE #IntelAndroid #Windows #OSX #Android #DeveloperTools #Software #Coding #Media for Mobile #IntelINDE

Imagen del icono:

Ejemplos de código

Introducción

Comercialización

Asistencia de productos

Herramientas de desarrollo

Desarrollo de juegos

Gráficos

Procesamiento de medios

Experiencia del usuario y diseño

Desarrolladores para Intel AppUp®

Microsoft Windows* 10

Microsoft Windows* 8.x

Incluir en RSS:

Avanzado

Principiante

Intermedio

↧

Removing CPU-GPU sync stalls in Galactic Civilizations* 3

July 2, 2015, 11:36 am

Latest and popular articles on Intel Technologies

≫ Next: La tecnología Intel® RealSense™ y la nube de puntos

≪ Previous: Building Cross-OS Mobile Applications with Intel® INDE

Download Document

Galactic Civilizations* 3 (GC3) is a turn-based 4X strategy game developed and published by Stardock Entertainment that released on May 14th, 2015. During the early access and beta periods, we profiled and analyzed the rendering performance of the game. One of the big performance improvements made was the removal of several CPU-GPU sync stalls that were responsible for losing some parallelism between the CPU and GPU. This article describes the issue and the fix and emphasizes the importance of using performance analysis tools during development, while keeping their strengths and limitations in mind.

Spotting the issue

We started the rendering performance analysis with Intel® INDE Graphics Performance Analyzers (GPA) Platform Analyzer. The screenshot below is a trace capture from the game (without v-sync) before improvements were made. The GPU queue has several gaps within and between frames, with less than one frame's worth of work queued up at any time. If the GPU queue isn't fed well by the CPU and has gaps, the application will never leverage that idle time to improve performance or visual fidelity.

Before:Frame time = ~21 ms – Less than 1 frame queued – Gaps in the GPU queue – Very long Map call

GPA Platform Analyzer also shows the time spent processing each Direct3D* 11 API call (i.e., application -> runtime -> driver and back). In the screenshot above, you can see an ID3D11DeviceContext::Map call that takes ~15 ms to return, during which the application's main thread does nothing.

The image below shows a zoom into one frame’s timeline, from CPU start to GPU end. The gaps are shown in pink boxes, amounting to ~3.5 ms per frame. Platform Analyzer also tells us the cumulative duration of various API calls for the trace, with Map taking 4.015 seconds out of the total 4.306 seconds!

It’s important to note that Frame Analyzer cannot spot the long Map call with a frame capture. Frame Analyzer uses GPU timer queries to measure the time for an erg, which consists of state changes, binding resources, and the draw. The Map however happens on the CPU, with the GPU unaware of it.

Debugging the issue

(See the Direct3D resources section at the end for a primer on using and updating resources.)

Driver debug revealed the long Map call to be using DX11_MAP_WRITE_DISCARD (Platform Analyzer doesn't show you the arguments of the Map call) to update a large vertex buffer that was created with the D3D11_USAGE_DYNAMIC flag.

This is a very common scenario in games to optimize the data flow to frequently updated resources. When mapping a dynamic resource with DX11_MAP_WRITE_DISCARD, an alias is allocated from the resource's alias-heap and returned. An alias refers to the memory allocation for the resource each time it is mapped. When there is no room for aliases on the resource's current alias-heap, a new shadow alias-heap is allocated. This continues to happen until the resource’s heap limit is reached.

This was precisely the issue in GC3. Each time this happened (which was multiple times a frame for a few large resources that were mapped several times), the driver waited on a draw call using an alias of the resource (which was allocated earlier) to finish, so it could reuse it for the current request. This wasn’t an Intel-specific issue. It occurred on NVIDIA's driver too and was verified with GPUView to confirm what we found with Platform Analyzer.

The vertex buffer was ~560 KB (size was found via the driver) and was mapped ~50 times with discard in a frame. The Intel driver allocates multiple heaps on demand (each being 1 MB) per resource to store its aliases. Aliases are allocated from a heap until they no longer can be, after which another 1 MB shadow alias-heap is assigned to the resource and so on. In the long Map call's case, only one alias could fit in a heap; thus, each time Map was called on the resource, a new shadow heap was created for that alias until the resource's heap limit was reached. This happened every frame (which is why you see the same pattern repeat), wherein the driver was waiting for an earlier draw call (from the same frame) to be done using its alias, in order to reuse it.

We looked at the API log in Frame Analyzer to filter resources that were mapped several times. We found several such cases, with the UI system being the lead culprit, mapping a vertex buffer 50+ times. Driver debug showed that each map updated only a small chunk of the buffer.

Same resource (handle 2322) being mapped several times in a frame

Fixing the issue

At Stardock, we instrumented all their rendering systems to get additional markers into the Platform Analyzer’s timeline view, in part to verify that the UI system was behind the large call and for future profiling.

We had several options for fixing the issue:

Set the Map flag to D3D11_MAP_WRITE_NO_OVERWRITE instead of D3D11_MAP_WRITE_DISCARD:
The large vertex buffer was being shared by several like-entities. For example, most of the UI elements on the screen shared a large buffer. Each Map call updated only a small independent portion of the buffer. The ships and asteroids that used instancing also shared a large vertex/instance data buffer. D3D11_MAP_WRITE_NO_OVERWRITE would be the ideal choice here since the application guarantees that it won't overwrite regions of the buffer that could be in use by the GPU.
Split the large vertex buffer into several smaller ones:
Since alias allocation was the reason behind the stall, considerably reducing the vertex buffer size allows several aliases to fit in a heap. GC3 doesn't submit too many draw calls, and hence, reducing the size by a factor of 10 or 100 (560 KB to 5-50 KB) would fix it.
Use the D3D11_MAP_FLAG_DO_NOT_WAIT flag:
You can use this flag to detect when the GPU is busy using the resource and do other work before remapping the resource. While this lets the CPU do actual work, it'd make for a really bad fix in this case.

We went with the second option and changed the constant used in the buffer creation logic. The vertex buffer sizes were hardcoded for each subsystem and just needed to be lowered. Several aliases could now fit into each 1 MB heap, and with the comparatively low number of draw calls in GC3, the issue wouldn’t crop up.

Each rendering subsystem fix magnified the issue in another one, so we fixed it for all the rendering subsystems. A trace capture with the fixes and instrumentation, followed by a zoomed-in look at one frame, is shown below:

After: Frame time = ~16 ms – 3 frames queued – No gaps in GPU queue – No large Map calls

The total time taken by Map went down from 4 seconds to 157 milliseconds! The gaps in the GPU queue disappeared. The game had 3 frames queued up at all times and was waiting on the GPU to finish frames to submit the next one! The GPU was always busy after a few simple changes. Performance went up by ~24% with each frame taking ~16 ms instead of ~21 ms.

Importance of GPU profiling tools during game development

Here’s what Stardock had to say:

Without tools like GPA Platform Analyzer or GPUView, we wouldn't have known what was happening on the GPU because the information we get back from DirectX is only if the call succeeded or not. Traditionally, we would have disabled systems, or parts of systems, to try to isolate where the performance costs are coming from. This is a very time consuming process, which can often consume hours or days without any practical benefit, especially, if the bottlenecks aren’t in the systems you expect.

Also, measuring isolated systems can often miss issues that require multiples systems to interact to cause the problem. For example, if you have a bottleneck in the animation system you may not be able to identify it if you have enough other systems disabled that the animation system (which is your performance problem) now has enough resources to run smoothly. Then you spend time troubleshooting the wrong system, the one you removed, instead of the source of the actual problem.

We have also tried to build profiling tools into our games. Although this works, we only get measurement data on the systems we explicitly measure, again making us unable to see issues from systems we wouldn’t expect. It is also a lot of work to implement and has to be maintained through the games development to be usable. And we need to do it over again with each game we make. So we get partial information at a high development cost. Because of this, issues can be hard to detect just by looking over the code, or even stepping through it, because it may appear correct and render properly, but, in reality, it is causing the GPU to wait or perform extra work.

This is why it is important to understand what is happening on the GPU.GPU profiling tools are critical for quickly showing developers where their code is causing the GPU to stall or where the frame is spending the most time. Developers can then identify which areas of the code would benefit the most from optimization, so they can focus on making improvements that make the most noticeable changes to performance.

Conclusion

Optimizing the rendering performance of a game is a complex beast. Frame and Trace capture-replay tools provide different and important views into a game’s performance. This article focused on CPU-GPU synchronization stalls that required a trace tool like GPA Platform Analyzer or GPUView to locate.

Credits

Thanks to Derek Paxton (Vice President) and Jesse Brindle (Lead Graphics Developer) at Stardock Entertainment for the great partnership and incorporating these changes into Galactic Civilizations 3.

Special thanks to Robert Blake Taylor for driver debug, Roman Borisov and Jeffrey Freeman for GPA guidance, and Axel Mamode and Jeff Laflam at Intel for reviewing this article.

About the author

Raja Bala is an application engineer in the game developer relations group at Intel. He enjoys dissecting the rendering process in games and finding ways to make it faster and is a huge Dota2* and Valve fanboy.

Direct3D* resources primer

The Direct3D API can be broken down into resource creation/destruction, setting render pipeline state, binding resources to the pipeline, and updating certain resources. Most of the resource creation happens during the level/scene load.

A typical game frame consists of binding various resources to the pipeline, setting the pipeline state, updating resources on the CPU (constant buffers, vertex/index buffers,…) based on simulation state, and updating resources on the GPU (render targets, uavs,…) via draws, dispatches, and clears.

During resource creation, the D3D11_USAGE enum is used to mark the resource as requiring:

(a) GPU read-write access (DEFAULT - for render targets, uavs, infrequently updated constant buffers)
(b) GPU read-only access (IMMUTABLE - for textures)
(c) CPU write + GPU read (DYNAMIC - for buffers that need to be updated frequently)
(d) CPU access but allowing the GPU to copy data to it (STAGING)

Note that the resource's D3D11_CPU_ACCESS_FLAG needs to also be set correctly to comply with the usage for c & d.

In terms of actually updating a resource's data, the Direct3D 11 API provides three options, each of which is used for a specific usage (as described earlier):

(i) Map/Unmap
(ii) UpdateSubresource
(iii) CopyResource / CopySubresourceRegion

One interesting scenario, where implicit synchronization is required, is when the CPU has write access and GPU has read access to the resource. This scenario often comes up during a frame. Updating the view/model/projection matrix (stored in a constant buffer) and the (animated) bone transforms of a model are examples. Waiting for the GPU to finish using the resource would be too expensive. Creating several independent resources (resource copies) to handle it would be tedious for the application programmer. As a result, Direct3D (9 to 11) pushes this onto the driver via the DX11_MAP_WRITE_DISCARD Map flag. Each time the resource is mapped with this flag, the driver creates a new memory region for the resource and lets the CPU update that instead. Thus, multiple draw calls that update the resource end up working on separate aliases of the resource, which, of course, eats up GPU memory.

For more info on resource management in Direct3D, check:

John McDonald's "Efficient Buffer Management" presentation at GDC
Direct3D 11 Introduction to resources
Direct3D 10 Choosing a resource
UpdateSubresource v/s Map

Microsoft Windows* 8.x

Analizadores de rendimiento de gráficos

↧

La tecnología Intel® RealSense™ y la nube de puntos

July 8, 2015, 3:27 pm

Latest and popular articles on Intel Technologies

≫ Next: Manageability Commander Web Edition

≪ Previous: Removing CPU-GPU sync stalls in Galactic Civilizations* 3

Descargar PDF

1. Introducción

Todo aquel que desarrolle aplicaciones gráficas, se encontrará en algún momento con el término “nube de puntos”, el cual, con respecto a la programación 3D, hace referencia simplemente a un conjunto de vectores o puntos que representan una forma. En la representación 3D tradicional, los puntos no son suficientes por sí mismos para proporcionar una representación visual de la forma, porque representan una única coordenada en el espacio, no un volumen o alguna asociación con puntos circundantes que pudiese implicar una superficie. Por lo general, corresponde al programador unir estos puntos para formar polígonos, o recurrir a otras técnicas que definan superficies, y obtener de esa manera una representación sólida de la forma en cuestión.
Figura 1.Una nube de puntos que representa un toro.

Hay mucha información disponible sobre captura, manipulación y representación de conjuntos de datos de nubes de puntos, pero muy pocos consejos sobre cómo aplicar este concepto a la creación de aplicaciones Intel® RealSense™.

En este artículo se verán recomendaciones sobre API, técnicas básicas y tecnología que se puede investigar a fin de contar con algunas herramientas más. Para entender el contenido es conveniente, pero no esencial, tener conocimientos básicos del SDK de Intel® RealSense™, programación 3D y estructuras geométricas.

2. Por qué es importante

Cuando se consideran los datos en bruto que se obtienen de las cámaras de profundidad típicas, son más bien una nube de puntos alineada dentro de una cuadrícula común, no una figura tridimensional. Esta sutil distinción es clave para encontrar soluciones innovadoras a las dificultades que enfrentamos en la actualidad.

También cabe decir que todavía se debe resolver el problema de manipular con precisión espacios 3D virtuales solo con las manos, como tomar una pelota virtual desde el aire o esculpir una estatua de arcilla. Este tipo de actividades parece adecuarse con naturalidad a la tecnología Intel RealSense, pero crear las técnicas que lo permitirían supera lo que se puede hacer con la mayoría de los kits de desarrollo de software (SDK) y deja en manos de los programadores innovar para encontrar soluciones.

Además de las posibilidades de colisión ya mencionadas, otro aspecto importante de pensar en los datos de profundidad en bruto como una nube de puntos es que nos permite combinar esos datos en representaciones 3D más exactas de formas tridimensionales. Por ejemplo, es posible escanear una habitación desde distintas perspectivas y ángulos, recopilar los datos como puntos y luego detectar puntos en común para unir los datos.

A quien todavía no esté convencido de que las nubes de puntos son un medio poderoso en el cual trabajar, lo invito a que busque en internet el video “The Shipping Galleries - A 3D Point Cloud Fly Through” y “Real-time Rendering of Massive Unstructured Raw Point Clouds” y que mire cómo se puede virtualizar el mundo real.

Imaginemos ahora una tecnología que utilice datos de nube de puntos generados en tiempo real, sin emplear el método tradicional de conjuntos de datos de 100 millones de puntos. Imaginemos tener caracterizaciones realistas en tiempo real en nuestro mundo virtual, controlar los objetos virtuales desde el mundo real e idear soluciones jamás concebidas.

Figura 2.PerceptuCam es una aplicación Intel® RealSense™ para conferencias que utiliza datos de puntos y crea con ellos una versión virtual del usuario.

Empezar desde lo más básico y entender todo lo necesario acerca de las nubes de puntos puede ser muy valioso para futuros proyectos. ¿Cuánto falta para que veamos camionetas de Google capturando nubes de puntos por las calles en tiempo real y transmitiéndolas a la nube para que las consuman al instante millones de usuarios que se trasladan de un lugar a otro? ¿Cuánto falta para que todas las cámaras de seguridad de las grandes ciudades integren equipos de captura de profundidad mediante escaneo y alojen petabytes de datos de nube de puntos en grupos de servidores de uso gratuito? Las nubes de puntos no van a quedarse tal cual están, y la tecnología Intel RealSense es la puerta de entrada para trabajar con ellas en tiempo real antes de que se conviertan en un recurso generalizado para los consumidores.

3. El punto de partida

En un artículo anterior analicé en profundidad la captura, el almacenamiento y el uso de datos 3D provenientes de una cámara de profundidad, desde la perspectiva de la generación de geometría tridimensional. El artículo se llama “3D a partir de datos de profundidad” (https://software.intel.com/en-us/articles/perpetual-computing-generating-3d-from-depth-data).

Figura 3.Prototipo en fases iniciales en el que se muestra cómo unir datos de profundidad sin procesar para crear geometría 3D.

La única diferencia entre el 3D creado en el artículo anterior y obtener la nube de puntos ahora es que no hay paso de unión. Una vez que se ha determinado la distancia de profundidad (Z) en la cuadrícula de escaneo fijo (XY), el arreglo en el que se han almacenado estos vectores pasa a ser nuestro conjunto de datos de nube de puntos, como se muestra en los ejemplos 1 y 2 siguientes.

EJEMPLO DE CÓDIGO 1. Creación de la estructura de datos de la nube de puntos

// estructura vectorial básica y arreglo de conjunto de datos de nube de puntos
struct vec3
{
	float x;
	float y;
	float z;
}
vec3* dataset = new vec3[depthwidth*depthheight];
dataset[(y*depthwidth)+x].x=(float)x;
dataset[(y*depthwidth)+x].y=(float)y;
dataset[(y*depthwidth)+x].z=(float)depthdistance;

Repasemos y resumamos: cuando ya hayamos inicializado la cámara de profundidad y obtenido un flujo de datos de profundidad, podremos rellenar un arreglo 2D de valores cortos (16 bits) que contienen la distancia desde la cámara hasta un objeto sólido detectado. El tamaño del arreglo 2D refleja la resolución del formato de datos de profundidad que se ha elegido. Al momento de redactarse este texto, existen varias cámaras que ofrecen resoluciones de profundidad de 320 x 240 a 640 x 480 y producen una nube de entre 76.800 y 307.200 puntos. Como es razonable suponer que cada punto consume 12 bytes (4 bytes por eje de vértice), estamos hablando de más de 900 kB para guardar un solo conjunto de datos de nube de puntos sin comprimir.

4. Un poco de color

Un aspecto que no hemos comentado de los datos provenientes de la cámara es que el flujo de color adicional es a menudo una imagen de mayor resolución. El SDK de Intel RealSense proporciona un flujo de asignación con búsqueda para poner en correlación el punto de profundidad con el color en la ubicación correspondiente.

EJEMPLO DE CÓDIGO 2. Estructura de datos ampliada con el componente RGB

// estructura vectorial básica y arreglo de conjunto de datos de nube de puntos
struct vec3
{
	float x;
	float y;
	float z;
  unsigned char red;
  unsigned char green;
  unsigned char blue;
}
vec3* dataset = new vec3[depthwidth*depthheight];
int datasetindex = (y*depthwidth)+x;
dataset[datasetindex].x=(float)x;
dataset[datasetindex].y=(float)y;
dataset[datasetindex].z=(float)depthdistance;
dataset[datasetindex].red=((*(DWORD*)colorStreamPtr)&0xFF0000)>>16;
dataset[datasetindex].green=((*(DWORD*)colorStreamPtr))&0xFF00)>>8;
dataset[datasetindex].blue=((*(DWORD*)colorStreamPtr))&0xFF);

Aumentar la estructura de datos de puntos para incluir un componente RGB hace posible reconstruir no solo la forma, sino también la textura del objeto. Únicamente el hardware “LiDAR” más costoso es capaz de capturar a la vez información de color y de profundidad superprecisa. Por lo tanto, es muy recomendable aprovechar esta información proveniente del aparato del consumidor si se desea crear la mejor representación visual de lo que hay frente a la cámara.
Figura 4.Una instantánea del flujo de color donde se extrajeron de la representación los píxeles de profundidad más lejanos.

El componente extra de color aumentaría la estructura de datos por puntos en 24 bytes, más del doble del uso de memoria del conjunto de datos y el posterior transporte si se tiene la intención de almacenar estos paquetes de nube de puntos. Una manera de reducir esta carga es con simple compresión RGB, ya sea que se utilice un formato 565 (5 bits para el rojo, 6 bits para el verde, 5 bits para el azul) o algo más atrevido, como una paleta y un índice de búsqueda.

5. Usos del conjunto de datos de nube de puntos

Supongamos que hemos guardado el conjunto de datos como arreglo vectorial típico y que estamos listos para visualizar o controlar algo. Son varias las técnicas que se pueden emplear y en este artículo vamos a hablar de algunas.

API de nube de puntos

Para empezar con todo, hay un proyecto muy conocido de código fuente abierto que se llama PCL (Point Cloud Library; http://pointclouds.org/). Contiene numerosas operaciones comunes para nube de puntos clasificadas por áreas de interés. Incluye filtros, detección de puntos clave, generación de árboles para ordenación, segmentación, detección de superficies, reconocimiento de formas 3D y varias técnicas de visualización.
Figura 5.Parte lo que ofrece Point Cloud Library (Copyright © PCL¹)

Los detalles específicos de estos módulos especializados van más allá de lo que cubre este artículo, pero con un poco de paciencia y padecimiento, se llega a buen puerto. Consejos y ayuda no faltarán con esta invaluable API: son numerosísimas las personas que hacen aportes.

Manipulación física

Muchas aplicaciones para Intel RealSense se contentan con extraer de la posición de la mano o el rostro algo similar a una coordenada de puntero de mouse y no piden más de los datos de profundidad. Con el poder de las nubes de datos, se puede convertir la mano en un objeto físico real, y se le otorga a la aplicación el mismo grado de control virtual que uno tendría en el mundo real.

La técnica implica crear unas 76.000 esferas físicas (datos de profundidad de 320 x 240) y llevarlas a las posiciones en tiempo real de los puntos del conjunto de datos, eliminando todos los movimientos y colisiones de gran energía como parte del proceso. El resultado que se obtiene es una superficie física precisa de la mano visible, capaz de interactuar con otros objetos físicos del mundo virtual, levantar, empujar, tomar, golpear y tocar en un sistema de control totalmente nuevo.

Una manera de regular el tamaño del conjunto de datos es mediante el control de las muestras de datos de profundidad, para equilibrar la resolución de la mano 3D con el costo general de procesamiento. Quienes estén familiarizados con las técnicas para física de las GPU modernas pueden incluso transmitir todo el conjunto de datos a la memoria de vídeo y tener un nivel de granularidad sustancialmente mayor para la simulación.

Representación visual

La representación gráfica de datos de nube de puntos sin procesar se parece a una serie de puntos diminutos en una pantalla relativamente vacía, semejante a un esbozo 3D fantasmal. Este no sería un método de presentación deseable para la mayoría de las aplicaciones, pero hay varias maneras de convertir el enjambre de puntos en algo sólido.
Figura 6.Incluso con una gran concentración de puntos, sigue siendo difícil ver la representación (© PCL¹).

La primera técnica ya se trató en un artículo titulado “3D a partir de datos de profundidad”. En esencia, hay que crear un polígono a partir de los tres puntos más cercanos y avanzar por la malla siguiendo ese proceso, hasta unir todos los puntos. Esta técnica tiene varias ventajas y desventajas. La desventaja principal es que no distingue qué formas se deben separar. Por ejemplo, la mano está separada de la cabeza, pero el método de unión básico no se da cuenta y supone que la superficie es una sola e ininterrumpida. Otra desventaja es el costo de procesamiento que implica unir tantos puntos para crear polígonos, paso que debe hacerse en cada ciclo y que consume muchos recursos de CPU y GPU. La última desventaja es que la aplicación necesita más memoria para almacenar esta malla 3D ya generada, porque el tamaño final es mayor que el del conjunto de datos original (que también debe permanecer en la memoria). La ventaja número uno es que una vez generada la malla 3D, tiene todos los beneficios de un objeto geométrico convencional y se lo puede iluminar y agregar texturas y sombras según lo exija la aplicación.

Una técnica más experimental es ordenar los datos de nube de puntos en un árbol de búsqueda (en PCL se pueden encontrar más detalles sobre estas técnicas), y eliminar los datos en tiempo real después del procesamiento. De manera similar a como funciona la tecnología Voxel, se representa la forma en el espacio de pantalla y cada píxel desencadena una lectura en el árbol de búsqueda ordenado. Si la posición y el ángulo son fijos, la búsqueda puede ser muy rápida. Si se agrega una búsqueda más dinámica por ray-casting, se puede representar la forma de nube de puntos desde cualquier ángulo y posición. Además, como la búsqueda puede devolver el punto viable más cercano de cada píxel de pantalla, los espacios que normalmente acompañan las representaciones en bruto de la nube de puntos se rellenan. En la nota titulada “El árbol kd cuantificado” (http://research.edm.uhasselt.be/tmertens/papers/qkdtree.pdf) se puede encontrar más información sobre los árboles kd, y es un buen punto de partida para investigar más el tema.

Asignación automática de nubes de puntos

Otro uso interesante de la nube de puntos es que permite detectar con exactitud un marcador único dentro de una instantánea de un conjunto de datos de nubes de puntos y usar el marcador como punto de fijación para unir una segunda nube de puntos. Con ejecución en tiempo real y el uso de detección automática de marcadores, se puede crear una aplicación que, cuanto más tiempo permanezca una forma frente a la cámara, más aprenderá de ella. Tomemos como ejemplo una taza. Una instantánea de la parte delantera muestra como máximo un 50 % de la superficie, sin datos de puntos de la parte trasera. La instantánea se pasa a otro subproceso que comienza a identificar marcadores dentro de la nube de puntos (asa, borde circular, hendiduras, planos, cilindros, etc.) y los guarda para utilizarlos más adelante.
Figura 7.Los datos de puntos se pueden reducir a planos y cilindros simples con marcadores ideales (© PCL¹).

Con tandas posteriores de datos, se repite este proceso y se inicia un segundo proceso para aparear marcadores. Solo a aquellos que muestran una correlación elevada se les ordena unirse a los datos de nube de puntos en tiempo real. En teoría, cuando uno levanta un objeto y le muestra a la computadora todos sus lados, el software entiende cómo es la forma entera.

Fotocopiadora 3D

Ahora que se pueden conseguir con facilidad impresoras 3D, es posible convertir una malla 3D virtual en un objeto del mundo real con solo imprimirlo como elemento sólido. Si se utiliza la técnica mencionada, casi cualquier objeto se puede escanear en segundos con la cámara de profundidad, luego inspeccionarlo en el software y convertirlo después a un formato listo para la impresora 3D. Imaginemos que se pierde una pieza de nuestro juego de ajedrez preferido, entonces tomamos la pieza equivalente del otro color y la escaneamos mientras la hacemos girar con la mano.
Figura 8.Hay que sostener el objeto frente a uno y dejar que la computadora se encargue del resto.

Para mejorar la precisión, el software diferenciará entre el color de la mano, los dedos, el rostro y los matices predominantes del fondo. Quizá al iniciarse la aplicación, haya un pequeño “paso de calibración” y pida hacer un “saludo a la cámara” como puntapié inicial.

En la práctica, este tipo de escaneo libre sin controles nunca produce resultados que igualen en precisión a los realizados en laboratorios profesionales o con giros tipo rueda de alfarero, pero por medio de filtros de nube de puntos (en PCL hay más información sobre filtros de delineación y de reducción de ruido) se puede generar una malla 3D sellada de buena calidad a partir de una cantidad suficiente de muestras. Este proceso aprovecha en gran medida que la cámara de profundidad puede transmitir hasta 60 fotogramas por segundo cuando el flujo de color está desactivado, y produce muchas miniinstantáneas y más ocasiones para que un algoritmo de detección inteligente de marcadores haga su trabajo.

Detección de formas 3D

Si extendemos esta técnica, cuando tenemos una base de datos de referencias de marcadores y suficientes nubes de puntos asociadas como para reconstruir un objeto 3D, contamos con lo esencial para un sistema de detección de objetos 3D. Se puede crear software que almacene en caché una forma entera, y la siguiente vez que se muestre esa forma a la cámara, se activarán decenas de indicadores porque la computadora reconoce muchos marcadores que tiene registrados de ese objeto.
Figura 9.Con suficientes marcadores, se puede detectar esta taza a partir del asa y el cilindro (© PCL¹).

Serán innumerables las oportunidades que surgirán para los ingenieros de software cuando resolvamos los inconvenientes que presenta la detección instantánea de objetos. En la actualidad, las computadoras son capaces de reconocer unas pocas palabras y gestos, como un perro bien entrenado pero casi ciego. Con la capacidad de distinguir entre diferentes objetos y si se toma en cuenta el contexto para catalogar esas observaciones, la computadora deja de ser un aparato tosco que saca conclusiones a partir de generalizaciones y pasa a ser un dispositivo inteligente y muy específico, que abre las puertas a la posibilidad de realizar proyectos muy interesantes.

Pasada la etapa de detección de una taza, el software podría detectar rasgos faciales. Cuanto más tiempo se siente una persona frente a la computadora y más marcadores se asocien con ese “objeto”, más rápido podrá el equipo reconocer a la persona.

6. Trucos y consejos

Recomendamos

Antes de poner en práctica cualquier técnica para conjuntos de datos de nube de puntos, hay que implementar un método para representar en la pantalla la nube de puntos en bruto. Esto servirá como vista para depuración durante el desarrollo y también para confirmar que la cámara esté generando lo que debe.
Si se opta por usar la biblioteca Point Cloud Library, primero hay que configurar y compilar los ejemplos proporcionados y asegurarse de que las llamadas a la biblioteca funcionen. PCL tiene muchas dependencias, por lo que sería más fácil portar la aplicación Intel RealSense a un ejemplo de PCL existente, en lugar de lo opuesto.
Antes de embarcarse en la tarea de escribir una técnica visual propia para pasar de nube de puntos a 3D, conviene dedicar unas horas a investigar las muy buenas técnicas que ya existen, tales como triangulación de Delaunay, algoritmo de pivoteo de bola, reconstrucción de superficies de Poisson y formas alfa, entre muchas otras.
MeshLab es otra buena biblioteca de código fuente abierto que se puede utilizar para convertir nubes de puntos a mallas 3D, y viene con algoritmos de limpieza muy prácticos que ayudan a sellar y perfeccionar las mallas obtenidas.

Desaconsejamos

No hay que intentar procesar conjuntos de datos de nube de puntos en tiempo real con código que genere muchos árboles y que esté diseñado para que los siguientes accesos a los puntos sean rápidos. Estos tipos de código son para usar en pasos de preparación, no para procesamiento en tiempo real. Cuando sea posible, se debe probar la técnica de generación por separado en un prototipo antes de confiar en ella para el software principal.
No hay que intentar crear datos de nube de puntos con bloqueo de exclusión mutua porque reduce el rendimiento de toda la aplicación cuando se detienen subprocesos específicos. Es mejor crear una cadena de asignaciones de nubes de puntos que permita al flujo de profundidad generarlos lo más rápido posible y usar un segundo subproceso para las manipulaciones intensas que haya que realizar del conjunto de datos en cualquier otra parte.
No hay que dejar de tener presente la memoria y el espacio de almacenamiento que la aplicación empezará a exigir. La gran mayoría de las aplicaciones que trabajan con nubes de puntos consumen enormes cantidades tanto de memoria como de espacio en disco, y eso puede salirse de control con mucha facilidad. Hay que elaborar el presupuesto de recursos por adelantado.

7. Resumen

Quizás en un futuro no tan distante, entraremos en nuestro despacho, dormitorio o estudio, y cuando nos sentemos nos saluden con un amable “Hola, Lee. Qué bueno verte otra vez”. Y con voz algo irritada: “No estás usando anteojos. ¿Quieres que se te canse la vista?”. “Computadora, sí estoy usando anteojos”, contesto. “Sí, ahora los veo. Perdón”, es su respuesta.

Desde hace muchos años estoy convencido de que se puede entrenar a las computadoras y los robots hasta cierto grado. En la actualidad, “matamos” a los dispositivos informáticos al final de cada jornada. Los apagamos, borramos sus cerebros y los volvemos a encender al día siguiente. Colocamos y quitamos las numerosas cargas que queremos que transporte nuestro burro electrónico, pero el burro por sí mismo actúa mecánicamente, es incapaz de recordar quién es su dueño, y poco le importa. ¿No podríamos enseñarle a reconocernos de un vistazo, a que absorba pasivamente y cree una red neuronal de referencias a nubes de puntos, y las conecte a sus otros sentidos, como el tiempo, la ubicación y la tarea que está realizando? Entonces sería sencillo programar la computadora con algunas idiosincrasias humanas básicas, por ejemplo reconocer situaciones que se repiten con frecuencia: “Lee, hace siete días que tienes puesta la misma camisa. ¡Suerte que no tengo nariz!”. Más fácil aún sería reemplazar la programación convencional con comunicación visual directa: “No, computadora, esto es una tableta, no un teléfono”. “¡Muestra el dispositivo como corresponde!”, pide la computadora.

Esto puede sonar a ciencia ficción o a los desvaríos de un programador. Hace algunos años yo habría estado de acuerdo. La diferencia hoy en día es que contamos con una capacidad de almacenamiento en la nube suficiente para registrar años de experiencia informática, tanto de manera individual como colectiva. Tenemos sensores de los que se sirven los aparatos para vernos y oírnos, y el poder de cálculo para hacer realidad todo esto. Lo único que falta en este momento es que aparezca un adelantado, alguien intrépido que después de leer esto no exclame “¡qué delirio!” o “nunca va a pasar”, sino que piense “podría intentarlo”.

Acerca del autor

Cuando Lee Bamber no escribe artículos, es el director ejecutivo de The Game Creators (http://www.thegamecreators.com), una empresa británica que se especializa en el desarrollo y la distribución de herramientas para crear juegos. Esta empresa fundada en 1999 y la comunidad de desarrolladores de juegos que la acompaña crearon títulos muy populares, como Dark Basic, The 3D Game Maker, FPS Creator, App Game Kit (AGK) y, recientemente, Guru.

¹Un agradecimiento especial a POINTCLOUDS.ORG por compartir las imágenes de su sitio web mediante Creative Commons Attributions 3.0. De conformidad con los requisitos, incluimos el enlace a la licencia (http://creativecommons.org/licenses/by/3.0/), y confirmamos que no hemos hecho ningún cambio a las imágenes que son propiedad intelectual de PCL.

Avisos
Intel, el logotipo de Intel e Intel RealSense son marcas comerciales de Intel Corporation en los EE. UU. y otros países.
Copyright © 2015 Intel Corporation. Todos los derechos reservados.
*Es posible que la propiedad de otros nombres y marcas corresponda a terceros.

Intel® RealSense™ Technology

Microsoft Windows* 8.x

Tecnología Intel® RealSense™

C/C++

Intermedio

SDK de Intel® RealSense™

Tecnología Intel® RealSense™

↧

Manageability Commander Web Edition

July 10, 2015, 12:59 pm

Latest and popular articles on Intel Technologies

≫ Next: 白皮书：绿色应用和服务开发指南

≪ Previous: La tecnología Intel® RealSense™ y la nube de puntos

Web applications have gotten very powerful and when it comes to computer management, the industry is moving to the web. This makes sense, web application are instantly deployed, cross-platform and run with strict security rules. For years, I have been working on the MDTK and its most famous tool, the Manageability Commander. Today, we are releasing a first version of the Manageability Commander Web Edition that is completely built in Javascript. The goal here is simple, make it possible for anyone to interact with Intel® Active Management Technology (Intel® AMT) using only code that runs in a web browser. Imagine going to a web site and being able to manage all of your small business or corporate computers.

To make this happen, we built a Javascript WSMAN stack along with redirection protocol, remote desktop (KVM) and remote terminal libraries. We then used these libraries to write a fully web based Intel AMT console. Commander Web Edition runs within a node-webkit (nw.js) frame as a standalone tool, but can also be adapted to run on web servers. It’s an early version, but the most difficult parts are already present. The WSMAN stack allows us to interact with Intel AMT for configuration, power control and much more. We then have remote desktop and terminal for live management of the remote machine.

Moving forward, there are many opportunities for Intel® AMT as we make web based & cloud management a new option. We are looking for testing and feedback on this new software. If you are interested in adding Intel AMT capabilities to your own web applications, the source code includes samples that can get you started.

Downloads: http://opentools.homeip.net/open-manageability/web-management
Demonstration Video: https://www.youtube.com/watch?v=M22RQelBFA4
Presentation: http://info.meshcentral.com/downloads/mdtk/WebMDTK-Presentaion.pptx

Feedback appreciated,
Ylian Saint-HIlaire

Manageability Commander Web Edition tool allows you to connect and manage computer
that support Intel® Active Management Technology (Intel® AMT) all in Javascript

This is an early release, but the most complicated features are already present and working.
WSMAN, Hardware KVM and Serial-over-LAN is all web based.

The web application is built using a set of new JavaScript libraries that communicate directly with
Intel AMT. No need for a server to do anything, the smarts is all in the web application.

Imagen del icono:

Noticias

Device Management

Empresa

Tecnología Intel® vPro™

Código abierto

Empresas pequeñas

Tecnología Intel® Active Management

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Microsoft Windows* 8.x

Incluir en RSS:

Avanzado

Principiante

Intermedio

↧

白皮书：绿色应用和服务开发指南

July 13, 2015, 7:24 pm

Latest and popular articles on Intel Technologies

≫ Next: 成功开发者的5个特性

≪ Previous: Manageability Commander Web Edition

作者：英特尔公司与中国移动通信集团公司

主要贡献者：黄超（Intel），Sabharwal, Manuj R（Intel），方亮（Intel），李可（中国移动），
李雯雯（中国移动）等。

摘要：
随着智能手机和移动应用在全球的蓬勃发展，功耗成为了可用性的一大“拦路虎” 。用户对移动应用的操作耗费了终端大部分电量。在移动设备中，屏幕显示、应用处理和通信模块的耗电量最多，位居前三。因此，使用正确的方法设计移动应用可显著延长移动终端的电池续航时间。在本文中，我们将为您说明影响终端功耗的关键因素，并为您提供减少电量消耗的方法。这将有助于改善应用性能和提升用户体验。本文所述技术主要针对Android 系统相关应用。

点击如下链接，下载文章（PDF）：

↧

成功开发者的5个特性

July 14, 2015, 8:03 pm

Latest and popular articles on Intel Technologies

≫ Next: Use "column" option to display data on selective columns in the report of VTune(TM) Amplifier XE

≪ Previous: 白皮书：绿色应用和服务开发指南

Android 应用商店中的应用数量超过了 150 万之多，竞争非常激烈，新开发人员的应用想要获得关注是一件非常困难的事。然而，超过 20 亿设备基于 Android 系统运行，您绝对不希望错过这一平台。

如果您是新手，iHub 开发人员可提供几点建议帮助您的应用脱颖而出。

•   贵在坚持。成功的开发人员为其应用投入了相当大的精力和时间。他们接受反馈信息，利用这些极具建设性的评论创建更新、更出色的应用。
•   注重高品质，而非数量。许多开发人员认为，开发的应用越多，就越成功，但事实并非如此。与其耗费大量时间和精力开发三款普通的应用，不如开发一款出色的应用。
•   显示业务能力。大多数开发人员专注于应用的技术含量，但这只占工作的 30%。而其他的 70% 则是如何让应用创造利润。如果是一款付费应用，那么它的目标市场在哪里，目标客户如何发现这款应用？如果是一款免费应用，那么是否能通过提供优质内容或广告来获取利润？
•   学无止境。具备快速学习的能力。成功的开发人员通常能够洞察到最新的技术和市场趋势。
•   研究并掌握目标市场。如果无法满足市场需求，那么您的应用将会被市场遗弃。最佳的起点是确保您的应用具有娱乐性、教育意义，或能够方便人们的生活。

使用英特尔集成式本机开发人员体验 (INDE) 软件，您将轻松满足这些基本条件。它可为您提供创建面向 Android、Windows 和 OS X 的 C++ 和 Java 应用所需的一切，包括编译器技术、特定域函数库、性能分析器、硬件加速器仿真器以及其他卓越功能。最重要的是，您可以在限定时间内免费下载专业版。点击此处，立即开始应用开发吧！

Intel INDE

Android

Android Developer Tools

Imagen del icono:

Introducción

Android*

Herramientas de desarrollo

Incluir en RSS:

Co_author:

Hai Shen (Intel)

Avanzado

Principiante

Intermedio

↧

Use "column" option to display data on selective columns in the report of VTune(TM) Amplifier XE

July 15, 2015, 10:58 pm

Latest and popular articles on Intel Technologies

≫ Next: Programming and Compiling for Intel® Many Integrated Core Architecture

≪ Previous: 成功开发者的5个特性

Intel® VTune™ Amplifier XE 2015 can collect performance data of running application. General-exploration is a good analysis type for capturing all typical performance counters (Hardware Performance Monitor Unit in processor) which are microarchitecture related, and the tool in command line (amplxe-cl) can display all counters. For example –

$ amplxe-cl -c general-exploration -- ./threadexamine 13

Usually we can generate a report which records performance counters for all events, however – some performance counters are "zero", use:
$ amplxe-cl -R summary -r r019ge/
…
_{MACHINE_CLEARS.MASKMOV 0 0 100003

MEM_UOPS_RETIRED.SPLIT_STORES_PS 0 0 100003

MACHINE_CLEARS.MEMORY_ORDERING 0 0 100003

MACHINE_CLEARS.SMC 0 0 100003

PARTIAL_RAT_STALLS.FLAGS_MERGE_UOP_CYCLES 2820004230 282 2000003

PARTIAL_RAT_STALLS.SLOW_LEA_WINDOW 0 0 2000003

UOPS_ISSUED.ANY 65840098760 6584 2000003

UOPS_RETIRED.RETIRE_SLOTS 58490087735 5849 2000003

INT_MISC.RECOVERY_CYCLES 490000735 49 2000003

CPU_CLK_UNHALTED.THREAD_P 24820037230 2482 2000003

ITLB_MISSES.STLB_HIT 0 0 100003}

So, there is no necessary to display all counters in the report, the user may use selective events of interest (non-zero) to generate report, like:
# amplxe-cl -R hw-events -column=UOPS_ISSUED.ANY,UOPS_RETIRED.RETIRE_SLOTS,INST_RETIRED.ANY -r r019ge
_{amplxe: Using result path `/home/peter/problem_report/r019ge'

amplxe: Executing actions 50 % Generating a report Column filter is ON.

Function Module Hardware Event Count:INST_RETIRED.ANY (M) Hardware Event Count:UOPS_ISSUED.ANY (M) Hardware Event Count:UOPS_RETIRED.RETIRE_SLOTS (M)

---------------------------- ------------------ ----------------------------------------- ---------------------------------------- --------------------------------------------------

test threadexamine 26,568 31,210 27,400

solve threadexamine 21,880 34,610 31,060

__do_softirq vmlinux 20 10 30}

If user wants to generate hotspots report, column could be {CPU, Spin, Overhead}, the user can select column(s) from them. For example:
$ amplxe-cl -R hotspots -r r019ge -column=CPU,Spin
_{amplxe: Using result path `/home/peter/problem_report/r019ge'

amplxe: Executing actions 50 % Generating a report Column filter is ON.

Function Module CPU Time Spin Time

--------------------- ------------- -------- ---------

test threadexamine 3.577s 0s

solve threadexamine 3.536s 0s

__do_softirq vmlinux 0.009s 0s}

VTune Report Column

Imagen del icono:

Intel® VTune™ Amplifier XE

Incluir en RSS:

Intermedio

↧

Programming and Compiling for Intel® Many Integrated Core Architecture

July 16, 2015, 3:31 pm

Latest and popular articles on Intel Technologies

≫ Next: A Structured Performance Optimization Framework for Simultaneous Heterogeneous Computing

≪ Previous: Use "column" option to display data on selective columns in the report of VTune(TM) Amplifier XE

Compiler Methodology for Intel® MIC Architecture

This article is part of the Intel® Modern Code Developer Community documentation which supports developers in leveraging application performance in code through a systematic step-by-step optimization framework methodology. This article addresses: parallelization.

This methodology enables you to determine your application's suitability for performance gains using Intel® Many Integrated Core Architecture (Intel® MIC Architecture). The following links will allow you to understand the programming environment and help you evaluate the suitability of your app to the Intel Xeon and MIC environment.

Preparing for the Intel® Many Integrated Core Architecture
- Application Analysis for Intel® MIC Architecture Suitability
- Expectations for User Source Code Changes

Because of the rich and varied programming environments provided by the Intel Xeon and Xeon Phi processors, the Intel compilers offer a wide variety of switches and options for controlling the executable code that they produce. This chapter provides the information necessary to insure that a user gets the maximum benefit from the compilers.

New User Compiler Basic Usage

The Intel® MIC Architecture provides two principal programming models: the native model covers compiling applications to run directly on the coprocessor, the heterogeneous offload model covers running a main host program and offloading work to the coprocessor, including standard offload and the Cilk_Offload model. The following chapter gives you insights into the applicability of these models to your application.

Efficient Parallelization

The third level of parallelism associated with code modernization is vectorization and SIMD instructions. The Intel compilers recognize a broad array of vector constructs and are capable of enabling significant performance boosts for both scalar and vector code. The following chapter provides detailed information on ways to maximize your vector performance.

Vectorization Essentials

The final chapter in the section provides insight into some advanced optimization topics. Included are discussions of floating point accuracy, data movement, thread scheduling, and many more. This is a good chapter for users still not seeing their desired performance OR are looking for the last level of performance enhancements.

Advanced Optimizations

Native and Offload Programming Models
- Building Native Applications for Intel® MIC Architecture
- The Heterogeneous Offload Programming Model>
- Effective Use of Compiler Features for Offloading ^{Updated 08/2014!}
- OpenMP 4.0 combined offload constructs ^{New 08/2014!}
- Offload support for transferring arrays of pointers ^{New 08/2014!}
- Offload support for non-contiguous array slices ^{New 08/2014!}
- Using the Fortran 2008 BLOCK construct with the Intel® Xeon Phi™ coprocessor ^{New 08/2014!}
- Introduction to Asynchronous Offload (C++ and Fortran)
- Asynchronous Offload Examples (C++, Fortran)
- Cross-Compilation Challenges
- How to Achieve Peak Transfer Rate (C++, Fortran)
- Techniques to Reduce Offload-related Memory Allocation Overheads (C++, Fortran)
- Taking Advantage of Offload Pointer Association and alloc/into Keywords (C++,Fortran)

Intel® Parallel Studio XE

Intel® C++ Compiler

Intel® Fortran Compiler

Intel® Fortran Composer XE

Intel® Composer XE

Intel® C++ Composer XE

Intel® C++ Studio XE

Intel® Cluster Studio XE

OpenMP*

Modernización de código

Arquitectura Intel® para muchos núcleos integrados

Edición más reciente por:

AmandaS (Intel)

↧

A Structured Performance Optimization Framework for Simultaneous Heterogeneous Computing

July 17, 2015, 10:56 am

Latest and popular articles on Intel Technologies

≫ Next: Next-gen applications showcased by Intel® RealSense™ App Challenge

≪ Previous: Programming and Compiling for Intel® Many Integrated Core Architecture

Heterogeneous computing platforms with multicore host system and many-core accelerator devices have taken a major step forward in the mainstream HPC computing market this year with the announcement of HP Apollo 6000 Sys-tem’s ProLiant XL250a server with support for Intel® Xeon Phi™ coprocessors. Although many application developers attempt to use it in the same way as GPGPU acceleration platforms, doing so forfeits the processing capability of multicore host processors and introduces power inefficiency in corporate IT op-erations. In this paper, we propose an application optimization framework to turn a sequential legacy application into a highly parallel application to make use of the hardware resources both on the host CPU and on the accelerator devices to enable simultaneous heterogeneous computing. As a case study, we look at how to apply this framework and adopt a structured methodology to adapt a European option pricing application to take advantages of a heterogeneous computing environment.

Download the complete PDF

Download (295.72 KB)Download Now

Intel® Math Kernel Library

Modernización de código

Industria de servicios financieros

Arquitectura Intel® para muchos núcleos integrados

Computación en paralelo

Subprocesos

Vectorización

Dirección URL

↧

Next-gen applications showcased by Intel® RealSense™ App Challenge

July 20, 2015, 8:18 am

Latest and popular articles on Intel Technologies

≫ Next: Participe do Concurso INOVApps 2015 promovido pelo Ministério das Comunicações!

≪ Previous: A Structured Performance Optimization Framework for Simultaneous Heterogeneous Computing

By Marc Saltzman

Imagine one day you’re being rehabilitated after an orthopedic procedure, perhaps to repair a repetitive strain injury in your wrist. Instead of wearing sensors to monitor your progress – which are cumbersome, expensive and limited – you simply move your fingers in front of what looks like a webcam and the physician or physiotherapist gleans a more accurate reading.

Or on a more recreational note, envision yourself sitting down to play a game in front of your laptop, desktop or tablet. No controller? No problem. With hands outstretched, you perform minute gestures in the air to manipulate onscreen content – be it dangling your fingers to make it rain, punching forward to break through rocks or extending a forefinger to draw an object in the virtual sand.

Both of these scenarios summarize the grand prize-winning entries in the Intel® RealSense™ App Challenge, the third annual call to developers to create next-generation experiences using an Intel® RealSense™ 3D Camera and the Intel® RealSense™ Software Development Kit (SDK) for Windows*.

“Orthosense” by David Schnare, and the game “Seed” by Alexandre Ribeiro da Silva, took top spots out of the thousands of entries from 37 countries -- up from 19 last year. With an incentive like a cash prize pool of $465,000 awarded to 21 of the winners, developers were challenged to blur the lines between human and computer interaction with a camera similar to the one already embedded in many of today’s devices, including the HP Envy* laptop and Lenovo B50 All-in-One desktop.

Utilizing a best-in-class three-dimensional depth sensor, Intel® RealSense™ technology enables new ways to interact, including 22-point hand and finger tracking and gesture recognition, facial detection and tracking, speech recognition, and even background subtraction to create a kind of green screen – without needing a green screen.

Think of it as Kinect* on steroids.

“As rapidly and profoundly as technology continues to advance, one thing remains constant: the need for more human and intuitive ways to interact with it,” says Scott Steinberg, a leading analyst, futurist, and author of Make Change Work for You. “Whether scanning in favorite objects, like children’s toys, and reprinting them on demand at grandma’s house, navigating through 3D models of homes or tradeshow floors with the wave of a hand, or using gesture controls to flip through your music collection, it’s only natural for software and hardware developers alike to look to technology solutions such as this that provide more user-friendly and accessible controls.”

In the same way Microsoft's Kinect made it possible to engage with video games, films and TV shows with the flick of a wrist, Intel RealSense technology takes it to the next level for a variety of business, health, social, family and communications-related applications, adds Steinberg.

Dean Takahashi, lead writer for GamesBeat at VentureBeat, says he’s glad to see Intel “is seeding developers to make creative demos that take gaming in a new direction,” and the company is “walking a fine line between making a cool technology that has precise controls and delivering a solution that is affordable to everyone.”

Winners of the Intel RealSense App Challenge submitted entries in one of five categories: Collaboration, Open Innovation, Learning, Interact Naturally, and Gaming.

To learn more, please visit the Winner’s Showcase website.

A brief look at some highlights:

Orthosense

A collaborative effort between the UK and Canada, Orthosense uses Intel RealSense technology to identify, calculate and record hand and wrist range movements for orthopedic specialists and surgeons. The goal is to measure range of movement of patients as they rehabilitate from hand and wrist problems. With greater accuracy, comfort and speed, both patient and practitioner receive objective data to measure progress.

“Orthosense algorithms provided the positions of the joints and calculated the exact angle of any given joints at any given time,” explains Kinetisense Inc. chief officer David Schnare. “No need to use plain-old tools or expensive wearable equipment [as] the patient simply places their hand in front of the sensor and all the necessary calculations are performed with remarkable accuracy and speed.”

“The best part,” Schnare continues, “is the fact the range of motion identification is performed in less than half a second. The patient simply places their hand in front of the sensor and all the necessary calculations are performed with remarkable accuracy and speed.”

Schnare says the company’s goal is to take advantage of “the affordable Intel RealSense camera to produce affordable human movement analysis software that’s widely adopted by practitioners.”

Orthosense was awarded Grand Prize in the Open Innovation category.

Seed

Grand Prize winner in the Gaming category, Seed challenges players to help guide a floating seed though it’s journey to reforest a devastated land. Fitting, perhaps, as the developer is based in Brazil, which has one of the highest deforestation rates in the world.

Using intuitive gestures in front of the 3D camera, players control the environment to help the seed along, such as removing obstacles and making it rain.

“Intel RealSense technology gives a certain ‘magic’ feeling to the game since the player’s hand movements produce an instant response in the seed,” explains Alexandre Ribeiro, co-founder of AnimaGames. “Using the hand and fingers position detection system, we could develop the game’s core mechanic that works as a ‘guessing game,’ where the player has to think and perform a gesture that matches the required action.”

Virtual 3D Video Maker

First place winner in the Collaboration category, Virtual 3D Video Maker – as the name suggests – lets you record yourself as a 3D hologram for a more immersive communication experience. Along with playing it back in front of a number of digitally imported scenes of your choosing, you can also change the camera position over the course of the playback to add an extra dimension to your video blogs, messages or even real-time chats.

“I used both face direction and voice recognition to create a context-based, hands-free interface,” says Lee Bamber, CEO of The Game Creators in North Wales, UK. “I also used the real-time depth data from the camera to create a 3D representation of the user and store this data along with the real-time audio to create a true 3D recording that can be played back from different angles.”

About the Author

Marc Saltzman is one of North America's most recognizable and trusted tech experts, specializing in consumer electronics, business technology, interactive entertainment and Internet trends. Marc has authored 15 books since 1996 and currently contributes to nearly 50 high-profile publications in North America, including USA Today, MSN, Yahoo!, Costco Connection, Toronto Star, Movie Entertainment, TheLoop.ca, Telus Talks Business and Rogers Connected. Marc hosts various video segments, including "Gear Guide" (seen at Cineplex movie theaters and sister chains across Canada) and is a regular guest on CNN, CNN International and CTV's Canada AM. Marc also hosts "Tech Talk," a radio show on Montreal’s CJAD 800, part of Bell Media.

Follow Marc on Twitter: @marc_saltzman.

Find everything you need to know about how to develop Intel® RealSense™ applications at the Intel® Developer Zone.

Microsoft Windows* 10

Microsoft Windows* 8.x

Tecnología Intel® RealSense™

Avanzado

Principiante

Intermedio

SDK de Intel® RealSense™

Tecnología Intel® RealSense™

Desarrollo de juegos

Microsoft Windows* 8 Desktop

Experiencia del usuario y diseño

Cámara F200 frontal

Dirección URL

Tecnología Intel® RealSense™

↧

Participe do Concurso INOVApps 2015 promovido pelo Ministério das Comunicações!

July 23, 2015, 10:33 am

Latest and popular articles on Intel Technologies

≫ Next: Technical Articles on trending topics of Intel® Media SDK

≪ Previous: Next-gen applications showcased by Intel® RealSense™ App Challenge

Parceiros Intel, segue abaixo mais uma oportunidade para apresentarem suas soluções!

O Ministério das Comunicações lançou no dia 14/07, a segunda edição do Concurso INOVApps que tem como objetivo apoiar o desenvolvimento de aplicativos de interesse público para dispositivos móveis e TVs digitais conectadas.

O Concurso selecionará 100 projetos inéditos e originais, com prêmio no valor de R$ 50 mil para cada app selecionado, que poderão ser desenvolvidos para Android, iOS, Windows Phone, e Middleware Ginga.

As propostas de projetos submetidas devem estar enquadradas em um dos seguintes temas:

Educação/Ensino;
Saúde;
Mobilidade Urbana;
Segurança Pública;
Acessibilidade/Direitos Humanos;
Aferição da qualidade de serviços e políticas públicas;
Assistência Social;
Cultura;
Direitos e Defesa do Consumidor;
Melhoria da gestão no setor público;
Turismo e Grandes Eventos;
Tratamento de indicadores de políticas públicas (dados abertos);
Participação Social;
Trabalho e Renda e
Meio Ambiente.

Estão habilitadas a participar do concurso pessoas físicas, microempreendedor individual, micro e pequenas empresas.

Para mais detalhes acessem o edital do concurso em:
http://www.comunicacoes.gov.br/concurso-inovapps

oportunidades de negócio

ISV applications

Imagen del icono:

Desarrolladores para Intel AppUp®

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Incluir en RSS:

Principiante

Intermedio

↧

Technical Articles on trending topics of Intel® Media SDK

September 2, 2014, 7:09 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® INDE — инструмент для разработчиков игр, использующих коммерческие игровые движки

≪ Previous: Participe do Concurso INOVApps 2015 promovido pelo Ministério das Comunicações!

Legal Disclaimer

This page contains links to the technical articles written on some of the important topics in Media SDK, a component of Intel® Media Server Studio and Intel® INDE. The topics chosen are reflective of the frequently asked questions and discussions on the forum. This page will be updated periodically with more articles as they become available. We hope that these articles can help developers with answering some of their questions related to frequently asked questions on major topics.

Framework to writing Media Applications

Media Processing Features

Video Processing (VPP)

Tools & more

Media SDK White papers

Framework to writing Media Applications

Media Processing Features

Video Processing (VPP)

Tools & more

Media SDK White papers

AVC; HEVC; VP9; Analyzer

Desarrolladores

Linux*

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Microsoft Windows* 8.x

Intel® Media Server Studio

Intel® INDE

SDK de medios para Windows*

Procesamiento de medios

Servidor

Escritorio

Dirección URL

Mejora del rendimiento

↧

Intel® INDE — инструмент для разработчиков игр, использующих коммерческие игровые движки

July 31, 2015, 3:00 am

Latest and popular articles on Intel Technologies

≫ Next: Meshcentral - Sprinxle Cloud

≪ Previous: Technical Articles on trending topics of Intel® Media SDK

Отправлено: Neal Pierman (Intel), 2 февраля 2015 г.

Аннотация

Разработка игр — не слишком простая задача. Разработчикам приходится учитывать не только постоянно снижающуюся «долговечность» продуктов на разных платформах, но и необходимость поддержки множества версий ОС. Оптимизация игр даже для одной платформы — довольно сложный процесс, особенно с учетом повышения сложности систем и необходимости учитывать потребление электроэнергии устройством. Но благодаря наличию миллиардов устройств с Windows* и Android* потенциальная прибыль исключительно велика..

В этой статье мы расскажем о том, как выпущенный в прошлом году набор межплатформенных средств Intel® Integrated Native Development Experience (Intel® INDE) поможет быстро и легко создавать игры мирового уровня с высокой производительностью на устройствах под управлением Windows* и Android*. Эти средства очень полезны даже при использовании сторонних игровых движков, таких как Unity* или Unreal Engine* от Epic. Intel INDE поможет вашей игре выделиться на рынке за счет дополнительных преимуществ и возможностей.

Конечный результат применения Intel INDE — великолепные игры, привлекательные и интересные для покупателей.

Введение

Разработчикам игр нужны средства, позволяющие ускорить выпуск игр на рынок для постоянно расширяющегося набора платформ. Многие разработчики применяют коммерческие игровые движки, поскольку они значительно ускоряют создание проектов и предоставляют доступ к обширной пользовательской базе на разных платформах. Но, если не использовать дополнительные средства, выходящие за рамки игрового движка, производительность игры может оказаться недостаточной. Дополнительная производительность может дать более высокую и однородную кадровую скорость, более реалистичное изображение местности или даже удвоить (!) количество зомби, другими словами, сделать игру более привлекательной вне зависимости от конкретной целевой платформы. В условиях жесткой конкуренции на рынке ваша продукция должна чем-то выделяться, и, если вам удастся раньше остальных выпустить хорошую игру, это может стать ключом к успеху вашей компании.

Эта статья предназначена главным образом для пользователей коммерческих игровых движков, но средства Intel INDE будут полезны даже тем разработчикам, которые создают и используют собственные движки. Впрочем, этому будет посвящена другая статья данной серии (не пропустите ее).

Использование Intel INDE вместе с игровым движком

Если использовать игровой движок Unity* или Epic* (Unreal Engine*), то может показаться, что одного лишь движка вполне достаточно, а другие продукты, такие как Intel INDE, не нужны. В частности, многие разработчики ожидают, что игровой движок окажется универсальным средством. Достаточно лишь создать игровые ресурсы и убедиться, что в каждом кадре на экране появляется достаточное количество зомби.

К счастью, специалисты Intel уже давно сотрудничают с ведущими разработчиками игр, чтобы оптимизировать используемые вами игровые движки с использованием продукта в составе Intel INDE — компилятора Intel C++. Этот компилятор дает возможность для оптимизации движков под разные платформы, а другие средства анализа и оптимизации в составе Intel INDE отвечают за всестороннюю оптимизацию всего игрового движка для наивысшей производительности. В частности, благодаря партнерству Intel с Unity и Epic эти игровые движки обеспечивают отличную производительность вне зависимости от конкретной целевой платформы Intel, будь то Windows или Android.

Но даже если разработчики используют эти игровые движки, они найдут в Intel INDE полезные средства для ускорения игр и достижения еще более высокой производительности, чем при использовании только одного лишь игрового движка. Эти средства, выпускавшиеся ранее в составе продукта Intel Graphics Performance Analyzers (Intel GPA), теперь доступны только в Intel INDE. Их назначение:

отладка игровых ресурсов;
анализ и оптимизация производительности;
анализ потребления электроэнергии.

Средства оптимизации Intel INDE очень удобны для отладки игровых ресурсов. При просмотре пользовательских форумов Unity выясняется, что многие пользователи используют средства Intel INDE вместе со средствами разработки. Распространенный пример: запись кадра для подробного анализа, затем применение Graphics Frame Debugger (в данный момент только для Android) для последовательного анализа сцены (по одному вызову отрисовки за один шаг). На каждом этапе можно подробнейшим образом изучать все визуальные аспекты и свойства объектов: поворачивать их в реальном времени, чтобы находить неверно расположенные вершины, просматривать каркас модели для проверки проблем с уровнем детализации, изучать графические свойства объекта, просматривать одновременно кадровый буфер и буфер глубины. Например, если зомби на экране не появился в нужном месте, можно проверить буфер глубины и обнаружить, что зомби на самом деле отрисовывается за сараем, а не перед ним.

В том, что касается анализа производительности, заранее спланируйте достижимые цели с балансом между производительностью и визуальным качеством, а затем проверяйте достижение этих целей в ходе всего процесса разработки. Для этого такие средства, как Intel INDE System Analyzer, Graphics Frame Analyzer и Platform Analyzer, предоставляют ценнейшие возможности анализа и оптимизации. В этой статье рассматриваются стратегии выбора различных настроек производительности и оценки их влияния.

Как уже было сказано выше, нужно постоянно заботиться о том, чтобы ваша игра обладала наивысшим удобством и производительностью на целевых платформах. Единственный способ выделиться среди конкурентов — создать более привлекательную игру, где при сохранении нужной кадровой скорости требуется высокий уровень оптимизации, чтобы включить в игру все востребованные пользователями возможности. Кроме того, поскольку корпорация Intel сотрудничает с разработчиками игровых движков, многие движки уже включают данные профилирования, которые могут обрабатываться средствами Intel INDE при запуске игры. Затем можно воспроизвести файл трассировки в Platform Analyzer и наглядно изучить взаимодействие потоков с ЦП и ГП, чтобы определить, где находится узкое место производительности.

Для мобильных платформ одним из важнейших факторов является анализ потребления электроэнергии. В самых современных системах Intel применяется единое управление электропитанием для ЦП и ГП, поэтому использование слишком большой мощности может привести к замедлению работы ЦП или ГП, что отрицательно скажется на интерактивности игры. Запустите Intel INDE System Analyzer для анализа использования электроэнергии: не подскакивает ли потребление электричества в некоторых сценах в игре? Если да, то следует проанализировать игровые ресурсы, другие основные параметры и настройки и узнать, что и почему происходит.

Итак, не следует полагаться во всем только на игровой движок. Intel INDE может оказаться очень полезным средством и поможет выделиться среди конкурентов. Средства в составе Intel INDE помогут определить, какие из возможностей игры дают наибольший эффект.

Дальнейшие действия…

Разумеется, чтобы начать пользоваться продуктом, нужно его получить. Подробные сведения об Intel INDE см. на домашней странице продукта. Там также приводятся отличия между разными выпусками этого продукта; можно загрузить бесплатный выпуск Starter Edition или бесплатную ознакомительную версию выпуска Ultimate Edition.

На конференции Game Developer’s Conference (GDC) в Сан-Франциско в марте соответствующая информация была представлена в различных презентациях Intel (некоторые презентации проводились совместно с ведущими компаниями по разработке игр). Кроме того, на стенде Intel INDE описывались последние версии различных средств для анализа и оптимизации производительности.

Следите за другими статьями этой серии, где более подробно поясняется, каким образом среда Intel INDE поможет быстро и легко разрабатывать великолепные игры. В частности, найдите статьи, где показано, как Intel INDE помогает разработчикам, стремящимся применять собственные игровые движки. У Intel INDE множество дополнительных возможностей и преимуществ, помогающих быстро и без особых затруднений создавать отличные игры!

Дополнительные сведения об оптимизации компиляторов см. в нашем уведомлении об оптимизации.

Desarrolladores

Android*

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 8.x

↧

Meshcentral - Sprinxle Cloud

August 3, 2015, 8:35 am

Latest and popular articles on Intel Technologies

≫ Next: New, Exciting Media Transcoding Software Capabilities to Showcase at IBC 2015

≪ Previous: Intel® INDE — инструмент для разработчиков игр, использующих коммерческие игровые движки

As many of you know, Meshcentral is an open source remote management web site that allows you to remotely monitor and control a wide range of computers (Windows, OSX, Linux, Android, ChromeOS...) over the Internet with support for Intel® Active Management Technology (Intel® AMT) when present. Since the web site is open source under Apache 2.0 license, it's possible for anyone to setup their own Mesh server. In fact, we built an installer that automates the task. For a full internet setup, you still need to have a server with an external DNS Name and IP address along with a few other requirements. The nice thing is that you get your own server, but on the flip side, you don't have a team of people managing the server and providing support. This is where the people at Sprinxle come in.

Sprinxle is a company based in California that offers Intel AMT management solution and services. Recently, they added Sprinxle Cloud to their product line up. Sprinxle Cloud is a commercially supported version of Meshcentral. Late last year, they started taking the open source software and making the proper changes to it so they could offer it as a supported product. They offer e-mail and phone support along with many other services.

So, if you like Meshcentral's features and want a something that is backed by professionals, check them out at http://sprinxle.com/quickstart to get started. I have not use the service myself, but if you have, please let me know what you think.

Enjoy!
Ylian

http://sprinxle.com/quickstart

Imagen del icono:

Noticias

Tecnología Intel® vPro™

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Microsoft Windows* 8.x

Incluir en RSS:

Avanzado

Principiante

Intermedio

↧

New, Exciting Media Transcoding Software Capabilities to Showcase at IBC 2015

August 4, 2015, 10:30 am

Latest and popular articles on Intel Technologies

≫ Next: El desarrollador que quiere escribir código para cambiar el mundo

≪ Previous: Meshcentral - Sprinxle Cloud

See You at International Broadcasters Conference (IBC)

Amsterdam | Sept. 11 to 15

Visit Intel at IBC in Amsterdam, Sept. 11 to 15, to see demos of new media software capabilities—some of which are so special, that we can’t even list them! Of course, with consumer demand exploding for video content and ultra HD TVs, to stay competitive, media and video solution providers need to innovate now for HEVC, 4K, and UHD. And with Intel hardware and software, we make these transitions so much easier, fast, and powerful.

Learn how Intel® Media Server Studio, Intel® Video Pro Analyzer, and Intel® Stress Bitstreams and Encoder can help.

Register with free passcode 6451 for Exhibit Hall entry.
And see us in Hall 4, #B72.

HEVC; AVC; MPEG2; VP9; Bitstream; Analyzer

Microsoft Windows* (XP, Vista, 7)

Microsoft Windows* 10

Microsoft Windows* 8.x

Intel® Media Server Studio

Intel® Video Pro Analyzer

Intel® Stress Bitstreams and Encoder

Depuración

Empresa

Gráficos

Procesadores Intel® Core™

Procesamiento de medios

Optimización

Consumo eficaz de energía

Mejora del rendimiento

↧

El desarrollador que quiere escribir código para cambiar el mundo

August 4, 2015, 1:30 pm

Latest and popular articles on Intel Technologies

≫ Next: Upgrating a C# plug-in to Intel® vPro™ Platform Solution Manager 2.0.0.12

≪ Previous: New, Exciting Media Transcoding Software Capabilities to Showcase at IBC 2015

Ngesa Marvin supo desde pequeño que quería dedicarse a la ciencia y la ingeniería. Se crió en una aldea de Muhoroni (Kenia) y la primera dura lección que aprendió fue cuando recibió una descarga eléctrica al intentar arrancar cables de un medidor para armar un automóvil. Unos años más tarde, encontró en la casa de su tío un libro que explicaba cómo escribir código HTML. Después de experimentar un poco y obtener resultados incentivadores, le tomó el gusto y supo que quería desarrollar soluciones para problemas propios del continente africano, cambiarles la vida a otras personas y hacer del mundo un lugar mejor.

Su proyecto más reciente, Unicomm, es un algoritmo de movimiento que interpreta gestos de las manos, los lee como lengua de señas y los transmite como texto o audio, lo cual facilita la comunicación de las personas con problemas de audición.

“El código puede resolver varios problemas en África, desde el desempleo y las enfermedades hasta la corrupción y la necesidad de automatizar la agricultura y la ganadería. No existen límites; con la educación y el estímulo apropiados, brotarán de todos los rincones del continente innovaciones que resolverán los problemas de la actualidad”.

Ngesa estudia telecomunicaciones e ingeniería de la información, y si bien es en gran medida autodidacto en cuanto a escritura de código, admite que no todo ha ido sobre ruedas. Se topó con muchos errores mientras aprendía a escribir código en HTML, PHP y MySQL, pero superó los momentos de frustración, no bajó los brazos y aprendió de sus desaciertos hasta resolver los inconvenientes. “Cuando el código funciona, el trabajo se vuelve entretenido, y cuando uno lo disfruta, lo entiende”.

Nada es imposible

Ngesa es un firme creyente en que si uno desea algo con fervor, hará todo lo posible para alcanzarlo. Aunque reconoce las dificultades que tienen los desarrolladores en África, como la baja velocidad de conexión a Internet y el limitado acceso a hardware, supo ver lo importante que era sumarse a las agrupaciones locales de tecnología y trabajar junto con otros desarrolladores en sus proyectos. Estar rodeado de personas apasionadas y curiosas lo alentó a perfeccionar sus aptitudes y aprender de los demás siempre que pudo. “Que alguien te explique es mucho más conveniente que leer”.

Sugiere a los interesados en desarrollar aplicaciones que miren más allá de su entorno inmediato, participen en grupos de internet y dediquen más tiempo a practicar. Su consejo es tener paciencia, equivocarse y seguir intentando y aprendiendo. Según Ngesa, uno de los peores errores que cometen los desarrolladores inexpertos es ser rigurosos en exceso con sí mismos y darse por vencidos demasiado rápido. “La mayoría de la gente se queda en el camino por falta de confianza. Debemos comenzar con proyectos modestos e ir mejorándolos día a día. Lo esencial es encontrar lo que a uno lo apasiona y dedicarse a ello. No hay nada más gratificante en la vida que ir en busca de los propios sueños y trabajar con ahínco hasta alcanzar nuestras metas”.

Las mejores herramientas para el trabajo

Ngesa es un entusiasta de la arquitectura Intel y anima a todos los nuevos desarrolladores a crear código con Intel XDK, que les posibilita armar e implementar aplicaciones para múltiples sistemas operativos y cambiar sus requisitos y dispositivos objetivos con el fin de que lleguen a más usuarios. Describe la placa Intel Galileo como “una de las mejores placas de desarrollo para todos los aficionados a la electrónica”, porque mejora sus habilidades de manejo de software y electrónica, y les permite de esa manera hacer cosas fabulosas para prendas inteligentes.

Ngesa está muy entusiasmado con la revolución de la Internet de las Cosas y quiere ser uno de los que coloquen los cimientos sobre los que se apoyará un futuro de automatización y conectividad. También desea promover la electrónica de código fuente abierto. “Entramos en una era de creadores. Es necesario que conozcamos todo nuestro potencial y busquemos soluciones prácticas a los problemas africanos. Somos quienes mejor comprendemos nuestras dificultades y, por lo tanto, los más capacitados para atacar la raíz de los problemas y tener un efecto positivo en la vida de una gran cantidad de gente. Estoy convencido de que las soluciones para superar los obstáculos cotidianos deben ser ideadas para africanos por africanos”.

Para seguir los proyectos de Ngesa, haga clic aquí.

Descargue Intel XDK gratis hoy mismo y comience a trabajar en innovaciones que cambiarán la vida de todos.